{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Automated Text Generation & Data-Augmentation for Medicine, Finance, Law, and E-Commerce\n",
        "\n",
        "\n",
        "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/webinars_conferences_etc/data_augmentation_and_text_generation_for_finance_legal_medical_and_ecommerce/data_augmentation_and_text_generation_tutorial.ipynb)\n",
        " \n"
      ],
      "metadata": {
        "id": "yU1vQvZxHf0t",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "v9mqPF18r65c",
        "outputId": "2298209e-e550-4c4e-f5be-d41c6bdb5650",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Collecting nlu\n",
            "  Downloading nlu-4.0.0-py3-none-any.whl (570 kB)\n",
            "\u001b[K     |████████████████████████████████| 570 kB 15.5 MB/s \n",
            "\u001b[?25hCollecting pyspark\n",
            "  Downloading pyspark-3.3.0.tar.gz (281.3 MB)\n",
            "\u001b[K     |████████████████████████████████| 281.3 MB 45 kB/s \n",
            "\u001b[?25hCollecting spark-nlp<4.1.0,>=4.0.0\n",
            "  Downloading spark_nlp-4.0.2-py2.py3-none-any.whl (532 kB)\n",
            "\u001b[K     |████████████████████████████████| 532 kB 65.5 MB/s \n",
            "\u001b[?25hRequirement already satisfied: pandas>=1.3.5 in /usr/local/lib/python3.7/dist-packages (from nlu) (1.3.5)\n",
            "Collecting dataclasses\n",
            "  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)\n",
            "Requirement already satisfied: pyarrow>=0.16.0 in /usr/local/lib/python3.7/dist-packages (from nlu) (6.0.1)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from nlu) (1.21.6)\n",
            "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.3.5->nlu) (2.8.2)\n",
            "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.3.5->nlu) (2022.2.1)\n",
            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=1.3.5->nlu) (1.15.0)\n",
            "Collecting py4j==0.10.9.5\n",
            "  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)\n",
            "\u001b[K     |████████████████████████████████| 199 kB 65.8 MB/s \n",
            "\u001b[?25hBuilding wheels for collected packages: pyspark\n",
            "  Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=26697f5a0fdab74315eebfd86064504a750fe4be2542166a156b9f599f8cd14c\n",
            "  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885\n",
            "Successfully built pyspark\n",
            "Installing collected packages: spark-nlp, py4j, dataclasses, pyspark, nlu\n",
            "Successfully installed dataclasses-0.6 nlu-4.0.0 py4j-0.10.9.5 pyspark-3.3.0 spark-nlp-4.0.2\n"
          ]
        }
      ],
      "source": [
        "! pip install nlu pyspark"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import nlu \n",
        "\n",
        "gpt2_pipe = nlu.load('gpt2')\n",
        "gpt2_pipe"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jPYOO36KlRQg",
        "outputId": "182c15a2-a038-4a75-9d19-472231453b44",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "gpt2 download started this may take some time.\n",
            "Approximate size to download 442.7 MB\n",
            "[OK!]\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'gpt2': GPT2TRANSFORMER_b38120f8fb6b,\n",
              " 'document_assembler': DocumentAssembler_de4915f4985b}"
            ]
          },
          "metadata": {},
          "execution_count": 1
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "generation_result = gpt2_pipe.predict('My Favorite Food is!')\n",
        "print(generation_result['generated'].iloc[0])\n",
        "\n",
        "generation_result = gpt2_pipe.predict('My Favorite Food is!')\n",
        "print(generation_result['generated'].iloc[0])\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jFtuw34_lRfQ",
        "outputId": "93718482-0adf-4e9c-8beb-73e784e30d00",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            " My Favorite Food is!\n",
            "\n",
            "I love the way the food is cooked. I love the texture. I like the way it's cooked. It's a little bit of a challenge to get the right amount of flavor out of it. I'm not\n",
            " My Favorite Food is!\n",
            "\n",
            "I love the way the food is cooked. I love the texture. I like the way it's cooked. It's a little bit of a challenge to get the right amount of flavor out of it. I'm not\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "gpt2_pipe.print_info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JBBKxB1glRib",
        "outputId": "4ce41f98-db15-46f6-e74a-a9fa8b034154",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",
            ">>> component_list['gpt2'] has settable params:\n",
            "component_list['gpt2'].setBatchSize(4)                         | Info: Size of every batch | Currently set to : 4\n",
            "component_list['gpt2'].setIgnoreTokenIds([])                   | Info: A list of token ids which are ignored in the decoder's output | Currently set to : []\n",
            "component_list['gpt2'].setRepetitionPenalty(1.0)               | Info: The parameter for repetition penalty. 1.0 means no penalty. See `this paper <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details | Currently set to : 1.0\n",
            "component_list['gpt2'].setTask('')                             | Info: Transformer's task, e.g. 'is it true that'> | Currently set to : \n",
            "component_list['gpt2'].setTemperature(1.0)                     | Info: The value used to module the next token probabilities | Currently set to : 1.0\n",
            "component_list['gpt2'].setTopP(1.0)                            | Info: If set to float < 1, only the most probable tokens with probabilities that add up to ``top_p`` or higher are kept for generation | Currently set to : 1.0\n",
            "component_list['gpt2'].setMinOutputLength(10)                  | Info: Minimum length of the sequence to be generated | Currently set to : 10\n",
            "component_list['gpt2'].setMaxOutputLength(50)                  | Info: Maximum length of output text | Currently set to : 50\n",
            "component_list['gpt2'].setDoSample(False)                      | Info: Whether or not to use sampling; use greedy decoding otherwise | Currently set to : False\n",
            "component_list['gpt2'].setTopK(50)                             | Info: The number of highest probability vocabulary tokens to keep for top-k-filtering | Currently set to : 50\n",
            "component_list['gpt2'].setNoRepeatNgramSize(3)                 | Info: If set to int > 0, all ngrams of that size can only occur once | Currently set to : 3\n",
            ">>> component_list['document_assembler'] has settable params:\n",
            "component_list['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# .setDoSample is False by default and makes g enerations deterministic \n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generation_result = gpt2_pipe.predict('My Favorite Food is!')\n",
        "print(generation_result['generated'].iloc[0])\n",
        "\n",
        "print('_'*50)\n",
        "\n",
        "generation_result = gpt2_pipe.predict('My Favorite Food is!')\n",
        "print(generation_result['generated'].iloc[0])\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aN-lhLbXlRlx",
        "outputId": "9cbb83b7-3dbf-42f6-c13d-6d89e9ce45bc",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            " My Favorite Food is! If you love making things, this is it for you. Don't expect a dish to be super easy! I usually have my entire kitchen made by hand. You can definitely make them super easy by adding chopped almonds to the mix\n",
            "__________________________________________________\n",
            " My Favorite Food is!\n",
            "\n",
            "So what's all this about? First off, please don't say it's a food thing or anything. The reason is it's just food at a local restaurant. You can probably tell why this might surprise so many\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7E9y7GPb7gPF",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "import pandas as pd \n",
        "import nlu\n",
        "\n",
        "def generate_with_gpt2(gpt_pipe, prompts, log=True):\n",
        "  df = []\n",
        "  if isinstance(prompts,list):\n",
        "    for p in prompts:\n",
        "      df.append(gpt_pipe.predict(p))\n",
        "    df = pd.concat(df)\n",
        "  else : \n",
        "    df = gpt_pipe.predict(prompts) \n",
        "  if log:\n",
        "    print_generation_results(df)\n",
        "  return df \n",
        "\n",
        "def print_generation_results(df):\n",
        "  for idx,row in df[['generated']].reset_index().drop(columns='index').iterrows():\n",
        "    print(f'Example {idx}: {200*\"_\"}')\n",
        "    print(row.values[0])\n",
        "    print('\\n')\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Explore GPT Parameters Play with Paramns"
      ],
      "metadata": {
        "id": "yx7aegU8yaPa",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Sampling Methods\n",
        "\n",
        "\n",
        "Sampling means we **randomly** draw from a distribution of words. \n",
        "The probability distribution is conditioned on all previous tokens in a text to generate the next token. \n",
        "\n",
        "By default the distribution contains all words in the vocabulary of GPT2, where many candidates are incorrect to generate.\n",
        "\n",
        "There are two methods of reshaping and drawing from those distributions : \n",
        "\n",
        "1. **Top-K Sampling** Take the k most likely words from the original distribution. Redistribute probability mass among those k words and draw according to the new probabilities.\n",
        "\n",
        "2. **Top-P Nucleus sampling**  Take smallest possible set of N words, which  together have a probability of p. Redistribute probability mass among those N words and draw according to the new probabilities.\n",
        "\n",
        "\n",
        "\n",
        "Additionally, both methods can be tweaked ith the following parameters : \n",
        "\n",
        "- **temperature** : Parameter of the softmax function which affect the distrubtion computed by the model. The closer we are to 0, the more deterministic the probability will become, distribution tails will become slimmer and outlier word probabilites are more close to 0. Temperature values closer values to 1 make tails of probability fatter which makes outliers more probable and generic results less probable. \n",
        "\n",
        "\n",
        "These parameters are shared by all method : \n",
        "- **ignoreTokenIds**: A list of token ids which are ignored in the decoder's output (default: [])\n",
        "- **noRepeatNgramSize**: If set to int > 0, all ngrams of that size can only occur once \n",
        "- **repetitionPenalty**: The parameter for repetition penalty. 1.0 means no penalty.  https://arxiv.org/pdf/1909.05858.pdf> \n",
        "- **task**:  Transformer's task, e.g. 'is it true that'> (default: , current: generate)"
      ],
      "metadata": {
        "id": "O8nZ-6Diyb8N",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Play with temperature \n",
        "Set Temperature higher to make GPT more random/creative and text less coherent\n",
        "Temperature > 0  and Temperature <=1\n",
        "You must set `gpt2.setDoSample(True)` to have non-deterministic results"
      ],
      "metadata": {
        "id": "lt7xX40lye4n",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "generate_with_gpt2(gpt2_pipe, 'Hello my name is Gpt2 I love to')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 222
        },
        "id": "E_nnFk0TvQL1",
        "outputId": "47f5ae68-7c74-4d9c-bae8-c28df2016150",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is Gpt2 I love to play! I love it, my wife plays me on the other team and we always play on the same team so I can play with any team from wherever I could be.\n",
            "\n",
            "\n",
            "The game started with\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                          document  \\\n",
              "0  Hello my name is Gpt2 I love to   \n",
              "\n",
              "                                           generated  \n",
              "0   Hello my name is Gpt2 I love to play! I love ...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-1026ab4a-2031-4ddb-a590-f458d4a23035\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Hello my name is Gpt2 I love to</td>\n",
              "      <td>Hello my name is Gpt2 I love to play! I love ...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1026ab4a-2031-4ddb-a590-f458d4a23035')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-1026ab4a-2031-4ddb-a590-f458d4a23035 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-1026ab4a-2031-4ddb-a590-f458d4a23035');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "gpt2_pipe.print_info()"
      ],
      "metadata": {
        "id": "sRY5GYihvQUd",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "ff4134b4-0d0f-4e1e-f396-ddab2e5107e6",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",
            ">>> component_list['gpt2'] has settable params:\n",
            "component_list['gpt2'].setBatchSize(4)                         | Info: Size of every batch | Currently set to : 4\n",
            "component_list['gpt2'].setIgnoreTokenIds([])                   | Info: A list of token ids which are ignored in the decoder's output | Currently set to : []\n",
            "component_list['gpt2'].setRepetitionPenalty(1.0)               | Info: The parameter for repetition penalty. 1.0 means no penalty. See `this paper <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details | Currently set to : 1.0\n",
            "component_list['gpt2'].setTask('')                             | Info: Transformer's task, e.g. 'is it true that'> | Currently set to : \n",
            "component_list['gpt2'].setTemperature(1.0)                     | Info: The value used to module the next token probabilities | Currently set to : 1.0\n",
            "component_list['gpt2'].setTopP(1.0)                            | Info: If set to float < 1, only the most probable tokens with probabilities that add up to ``top_p`` or higher are kept for generation | Currently set to : 1.0\n",
            "component_list['gpt2'].setMinOutputLength(10)                  | Info: Minimum length of the sequence to be generated | Currently set to : 10\n",
            "component_list['gpt2'].setMaxOutputLength(50)                  | Info: Maximum length of output text | Currently set to : 50\n",
            "component_list['gpt2'].setDoSample(False)                      | Info: Whether or not to use sampling; use greedy decoding otherwise | Currently set to : False\n",
            "component_list['gpt2'].setTopK(50)                             | Info: The number of highest probability vocabulary tokens to keep for top-k-filtering | Currently set to : 50\n",
            "component_list['gpt2'].setNoRepeatNgramSize(3)                 | Info: If set to int > 0, all ngrams of that size can only occur once | Currently set to : 3\n",
            ">>> component_list['document_assembler'] has settable params:\n",
            "component_list['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "text = 'Hello my name is GPT2, I love to'\n",
        "data = [text,text,text]\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "gpt2_pipe['gpt2'].setTemperature(0.5)\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(100)\n",
        "generate_with_gpt2(gpt2_pipe, data)\n"
      ],
      "metadata": {
        "id": "HALLtW26vQgO",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 569
        },
        "outputId": "8b90b380-d47a-4ea8-8240-700d8a681040",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to read, and I'm a huge fan of the internet. I'm also a huge geek. I love books and TV, and this is my first time reading a new book. I also love anime. I don't think I've ever read anything before, and yet I'm so excited for this. I've already read some of the manga and anime, and it's been amazing. I think I'll finally get to read a new series.\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play games, and I am looking forward to the next one.\n",
            "\n",
            "I've been playing Hearthstone for a long time and I'm really enjoying it. I've seen a lot of great things about it, and the community is really nice. I'm looking forward for the next release, and hopefully I can get it done as soon as possible. I would love to hear your feedback, so please feel free to leave a comment below, or tweet\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play with players. I love the games, I like the games and I love playing with people.\n",
            "\n",
            "GPT2 : What game do you play?\n",
            "\n",
            "Kris : I play a lot of games. I play the classic board game with my friends. I am a little bit of a nerd. I like to play a bit of board games.\n",
            ".\n",
            " (laughs)\n",
            "\n",
            ".\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                           document  \\\n",
              "0  Hello my name is GPT2, I love to   \n",
              "0  Hello my name is GPT2, I love to   \n",
              "0  Hello my name is GPT2, I love to   \n",
              "\n",
              "                                           generated  \n",
              "0   Hello my name is GPT2, I love to read, and I'...  \n",
              "0   Hello my name is GPT2, I love to play games, ...  \n",
              "0   Hello my name is GPT2, I love to play with pl...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-299dcc25-caae-42a9-8c61-9748efe23066\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Hello my name is GPT2, I love to</td>\n",
              "      <td>Hello my name is GPT2, I love to read, and I'...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Hello my name is GPT2, I love to</td>\n",
              "      <td>Hello my name is GPT2, I love to play games, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Hello my name is GPT2, I love to</td>\n",
              "      <td>Hello my name is GPT2, I love to play with pl...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-299dcc25-caae-42a9-8c61-9748efe23066')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-299dcc25-caae-42a9-8c61-9748efe23066 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-299dcc25-caae-42a9-8c61-9748efe23066');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "dfs = [] \n",
        "for temp in [1.0, 0.75, 0.5, 0.25, 0.01, 0000.1, ]:\n",
        "  print(f'{25*\"-\"} Generation Parameter Temperature {temp} {25*\"-\"}')\n",
        "  text = 'Hello my name is GPT2, I love to'\n",
        "  data = [text,text,text]\n",
        "  gpt2_pipe['gpt2'].setDoSample(True)\n",
        "  gpt2_pipe['gpt2'].setTemperature(temp)\n",
        "  gpt2_pipe['gpt2'].setMaxOutputLength(100)\n",
        "  generate_with_gpt2(gpt2_pipe, data)\n"
      ],
      "metadata": {
        "id": "OAEnc1NmvQjv",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "53b77328-e325-4e71-bc7c-59572527e0bf",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "------------------------- Generation Parameter Temperature 1.0 -------------------------\n",
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play, the community I've grown in the past 10 years has had a hard time accepting it, but I feel that I'm doing it right.\n",
            "\n",
            "\n",
            "GPT2 is coming to PlayStation 4 for PS Vita and PS3 with a PS4 Pro title for release this fall. As per the request for the PS4 update, you can pre-order the beta version here: http://www.gpt2.com/about-us\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to go to all-day hikes. I love seeing the landscape and learning new things but the most important to me is just having fun.\n",
            "\n",
            "So I'm going to write an eBook on how you can have a great day at the trailhead. The eBook features a beautiful print of your favorite park signage, trail signage, the trail, and of course, GPS info. If you choose a GPS location, I recommend you read the manual for it\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to spend time and play video games since childhood. I play many games over the web that help me build good character by playing on consoles, however I still play on a console, so I keep on to many games that help my character make good ends. I like to play multiplayer with friends and when I live alone on my own or with others in an apartment I can use mobile games to give me the best experience by helping my other friends get to know\n",
            "\n",
            "\n",
            "------------------------- Generation Parameter Temperature 0.75 -------------------------\n",
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to read and write and I like to work on the web. I will continue to work and continue to share my knowledge and experience. Please give me your best wishes and I promise I will give you even more.\n",
            "\n",
            "For any questions or if you have any questions about our web site visit our FAQ page. And please keep in mind that we are only available to US residents and international customers.\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play with my friends, as well as find a balance between being able to play and not being able. I can't wait to see what I come up with for my next game!\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play and to play. And it's always fun to see the world. I'm just so happy to be back home again. I guess I will stay on.\n",
            "\n",
            "I like to think that my new home is very easy to navigate. And I like to see how I am doing on my way out. You know how I used to think I was a big person? I was just so small for so long. What I realized now as\n",
            "\n",
            "\n",
            "------------------------- Generation Parameter Temperature 0.5 -------------------------\n",
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to eat. I like to eat, I like being around people. I love being around other people.\n",
            "\n",
            "I'm not a big fan of the Internet, I don't like it. I don, I think it's a bad thing. I think there is no way that I can live my life without the Internet. It's a huge problem.\n",
            ".\n",
            " I'm not really a big user of the Web. I can't even go\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play with my friends and I love the game. I'm always looking for something new to play and I'm looking forward to playing with you guys.\n",
            "\n",
            "Hi, I'm GPT3, I play with a lot of players, it's my first time playing with a bunch of people and I'll be honest I don't think I'll ever play with anyone that's really good at it but I do like to play a lot. I\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play and I love the game. I want to play this game, so I'm going to play it. I am going to buy it.\n",
            "\n",
            "I am going on vacation from the US and I am on vacation. I have the best time. I'm trying to learn, I'm doing my best. I got my first Pokemon. I will play Pokemon.\n",
            " I am playing Pokemon. You can get money, you can buy Pokemon\n",
            "\n",
            "\n",
            "------------------------- Generation Parameter Temperature 0.25 -------------------------\n",
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play with my friends and I'm looking forward to playing with you. I'm going to be playing with my friend and I want to share my experience with you guys. I want you to know that I'm not going to lie and say that I am not a fan of the game. I am going to tell you that I love the game and I am very happy with the way it is. I love it. I really do. I think\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play with my friends and I love playing with my family. I think it's a great game, I think I'm going to be a great player. I'm just going to have to see how I play. I don — I'm not going to play until I'm 100 percent. I'll be 100 percent.\"\n",
            " that's a good thing.\n",
            "For the record, I'm still not 100 percent, but I'm definitely 100 percent sure\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to play games and I love my country. I'm a big fan of the game and I'm hoping to play it again soon. I'd like to play the game again. I'll be playing it again.\n",
            "\n",
            "I'm a huge fan of your game. I love your game and you're going to make it great.\n",
            " 5. I am a big gamer. I like to be around people. I play games. I enjoy playing games\n",
            "\n",
            "\n",
            "------------------------- Generation Parameter Temperature 0.01 -------------------------\n",
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to drive because even they cannot send orders which it pays 100 BTC out? Am pretty poor.. This also just works best out till some weird year we pay my friends/workers alot without finding back it ends being 30 000 so good bye mea time :( https0stcoiners' thread will send $.000000/week by August for their Bitcoin miner from bitcoinn\n",
            "5am1Xb Profile of 5 Posts Follow our business of not providing their\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to drive because even they cannot send orders which it pays 100 BTC out? Am pretty poor.. This also just works best out till some weird year we pay my friends/workers alot without finding back it ends being 30 000 so good bye mea time :( https0stcoiners' thread will send $.000000/week by August for their Bitcoin miner from bitcoinn\n",
            "5am1Xb Profile of 5 Posts Follow our business of not providing their\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to drive because even they cannot send orders which it pays 100 BTC out? Am pretty poor.. This also just works best out till some weird year we pay my friends/workers alot without finding back it ends being 30 000 so good bye mea time :( https0stcoiners' thread will send $.000000/week by August for their Bitcoin miner from bitcoinn\n",
            "5am1Xb Profile of 5 Posts Follow our business of not providing their\n",
            "\n",
            "\n",
            "------------------------- Generation Parameter Temperature 0.1 -------------------------\n",
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to drive because even they cannot send orders which it pays 100 BTC out? Am pretty poor.. This also just works best out of every reason at the moment the guys mentioned if buying stock now works good now we probably did pay you then pay then give 30%. However our position from Vodafreday not wanting to provide additional quotes etc (after several seconds then 5 points now if at once), are that possible anyway… Do you work either an active\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to drive because even they cannot send orders which it pays 100 BTC out? Am pretty poor.. This also just works best out of every reason at the moment the guys mentioned if buying stock now works good now we probably did pay you then pay then give 30%. However our position from VISA could look crazy fast cause visa couldn do in other financial business I never send stuff after last but buy any big price down 2 in first 50 i'd definitely still\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Hello my name is GPT2, I love to drive because even they cannot send orders which it pays 100 BTC out? Am pretty poor.. This also just works best out of every reason at the moment the guys mentioned if buying stock now works good now we probably did pay you then pay then give 30%. However our position from VISA could look crazy fast cause visa couldn do in other financial business I never send stuff after last but buy any big price down 2 in first 50 i'd definitely still\n",
            "\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# GPT2 Text Generation Industry Use-Cases"
      ],
      "metadata": {
        "id": "6MPVhsvd-rRr",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Zero-Shot-Learning Applications & Examples\n",
        "\n",
        "The model has not been trained to predict any of the  classes/predictions we are giving it"
      ],
      "metadata": {
        "id": "OM7W6OgQiqlr",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Data2Text for finance assets\n",
        "\n",
        "We can use the zero-shot learning capabilities of GPT2 and give it a series of asset prices and a description for the performance.\n",
        "\n",
        "This enables us to automatically generate textual descriptions of asset peformance for human consumption\n",
        "\n",
        "No need to hire a Data Analyst if GPT2 can your data to you and your users 😉\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "vKaEOBMtHrFx",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "gpt2_pipe['gpt2'].setDoSample(True) \n",
        "# We want the model to be coherent and logical\n",
        "# So we decrease Temp, Top-P and TopK \n",
        "gpt2_pipe['gpt2'].setTemperature(0.2) \n",
        "gpt2_pipe['gpt2'].setTopP(0.5) \n",
        "gpt2_pipe['gpt2'].setTopK(20)  \n",
        "# The Model needs to repeat the previously occuring Tokens\n",
        "# So we set Penealty to 0 and increase max NGram Size\n",
        "gpt2_pipe['gpt2'].setRepetitionPenalty(0)  \n",
        "gpt2_pipe['gpt2'].setNoRepeatNgramSize(5)  \n",
        "\n",
        "# by default the Document Assebler removes any newlines\n",
        "# Keeping them improves generation result\n",
        "gpt2_pipe['document_assembler'].setCleanupMode('disabled')\n",
        "\n",
        "prompt = \\\n",
        "\"\"\"week1 price = [14,13,14,16,18,20,22] -> good\n",
        "week2 price = [20,18,15,13,10,9,8] -> bad\n",
        "week3 price = [10,15,13,19,25,30,50] -> good\n",
        "week4 price = [60,65,73,74,76,77,86] ->\"\"\"\n",
        "\n",
        "# Dynamically set MaxOutputLength, based on prompt we give it \n",
        "# By giving it more Generation space, we can have a summary after the prediction\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(len(prompt)+5)\n",
        "\n",
        "\n",
        "gpt2_pipe.print_info()\n",
        "print(gpt2_pipe.predict(prompt).generated[0])\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CyTsQ5anJo1a",
        "outputId": "eeadb5c9-e176-4280-a6dc-d5065003b54f",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",
            ">>> component_list['gpt2'] has settable params:\n",
            "component_list['gpt2'].setBatchSize(4)                           | Info: Size of every batch | Currently set to : 4\n",
            "component_list['gpt2'].setIgnoreTokenIds([])                     | Info: A list of token ids which are ignored in the decoder's output | Currently set to : []\n",
            "component_list['gpt2'].setRepetitionPenalty(0.0)                 | Info: The parameter for repetition penalty. 1.0 means no penalty. See `this paper <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details | Currently set to : 0.0\n",
            "component_list['gpt2'].setTask('')                               | Info: Transformer's task, e.g. 'is it true that'> | Currently set to : \n",
            "component_list['gpt2'].setTemperature(0.2)                       | Info: The value used to module the next token probabilities | Currently set to : 0.2\n",
            "component_list['gpt2'].setTopP(0.5)                              | Info: If set to float < 1, only the most probable tokens with probabilities that add up to ``top_p`` or higher are kept for generation | Currently set to : 0.5\n",
            "component_list['gpt2'].setMinOutputLength(10)                    | Info: Minimum length of the sequence to be generated | Currently set to : 10\n",
            "component_list['gpt2'].setMaxOutputLength(176)                   | Info: Maximum length of output text | Currently set to : 176\n",
            "component_list['gpt2'].setDoSample(True)                         | Info: Whether or not to use sampling; use greedy decoding otherwise | Currently set to : True\n",
            "component_list['gpt2'].setTopK(20)                               | Info: The number of highest probability vocabulary tokens to keep for top-k-filtering | Currently set to : 20\n",
            "component_list['gpt2'].setNoRepeatNgramSize(5)                   | Info: If set to int > 0, all ngrams of that size can only occur once | Currently set to : 5\n",
            ">>> component_list['document_assembler'] has settable params:\n",
            "component_list['document_assembler'].setCleanupMode('disabled')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : disabled\n",
            " week1 price = [14,13,14,16,18,20,22] -> good\n",
            "week2 price = [20,18,15,13,10,9,8] -> bad\n",
            "week3 price = [10,15,13,19,25,30,50] -> good\n",
            "week4 price = [60,65,73,74,76,77,86] -> great\n",
            "\n",
            "week5 pricesethelessweek6 prices1865 prices prices week65etheless651522156065181325257314 ->week week65741525 -> price7465week bad73222525 bad15168625 bad ->73etheless151413 bad18147460etheless7318 price74 bad -> price7313 price86152522141418week131413etheless prices1816 prices73 week\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Treatment Reccomendations for Medicine"
      ],
      "metadata": {
        "id": "fSiV5EULiSAa",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "gpt2_pipe['gpt2'].setDoSample(True) \n",
        "# We want the model to be coherent and logical\n",
        "# So we decrease Temp, Top-P and TopK \n",
        "gpt2_pipe['gpt2'].setTemperature(0.5) \n",
        "gpt2_pipe['gpt2'].setTopP(0.8) \n",
        "gpt2_pipe['gpt2'].setTopK(50)  \n",
        "# The Model needs to repeat the previously occuring TOkens\n",
        "# So we set Penealty to 0 \n",
        "gpt2_pipe['gpt2'].setRepetitionPenalty(1)  \n",
        "gpt2_pipe['gpt2'].setNoRepeatNgramSize(3)  \n",
        "\n",
        "# by default the Document Assebler removes any newlines\n",
        "# Keeping them improves generation result\n",
        "gpt2_pipe['document_assembler'].setCleanupMode('disabled')\n",
        "\n",
        "prompt = \\\n",
        "\"\"\"Cough -> Promethazine\n",
        "Headache -> Naproxen\n",
        "Nose Bleeding -> Tranexamic acid\n",
        "Diarrhea ->  \"\"\"\n",
        "\n",
        "# Dynamically set MaxOutputLength, based on prompt we give it \n",
        "# By giving it more Generation space, we can have a summary after the prediction\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(len(prompt)+15)\n",
        "\n",
        "\n",
        "gpt2_pipe.print_info()\n",
        "print(gpt2_pipe.predict(prompt).generated[0])\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "bb777533-0661-4aba-87ac-f32d7777bee6",
        "id": "2zbTCk5295BY",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",
            ">>> component_list['gpt2'] has settable params:\n",
            "component_list['gpt2'].setBatchSize(4)                           | Info: Size of every batch | Currently set to : 4\n",
            "component_list['gpt2'].setIgnoreTokenIds([])                     | Info: A list of token ids which are ignored in the decoder's output | Currently set to : []\n",
            "component_list['gpt2'].setRepetitionPenalty(1.0)                 | Info: The parameter for repetition penalty. 1.0 means no penalty. See `this paper <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details | Currently set to : 1.0\n",
            "component_list['gpt2'].setTask('')                               | Info: Transformer's task, e.g. 'is it true that'> | Currently set to : \n",
            "component_list['gpt2'].setTemperature(0.5)                       | Info: The value used to module the next token probabilities | Currently set to : 0.5\n",
            "component_list['gpt2'].setTopP(0.8)                              | Info: If set to float < 1, only the most probable tokens with probabilities that add up to ``top_p`` or higher are kept for generation | Currently set to : 0.8\n",
            "component_list['gpt2'].setMinOutputLength(10)                    | Info: Minimum length of the sequence to be generated | Currently set to : 10\n",
            "component_list['gpt2'].setMaxOutputLength(104)                   | Info: Maximum length of output text | Currently set to : 104\n",
            "component_list['gpt2'].setDoSample(True)                         | Info: Whether or not to use sampling; use greedy decoding otherwise | Currently set to : True\n",
            "component_list['gpt2'].setTopK(50)                               | Info: The number of highest probability vocabulary tokens to keep for top-k-filtering | Currently set to : 50\n",
            "component_list['gpt2'].setNoRepeatNgramSize(3)                   | Info: If set to int > 0, all ngrams of that size can only occur once | Currently set to : 3\n",
            ">>> component_list['document_assembler'] has settable params:\n",
            "component_list['document_assembler'].setCleanupMode('disabled')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : disabled\n",
            " Cough -> Promethazine\n",
            "Headache -> Naproxen\n",
            "Nose Bleeding -> Tranexamic acid\n",
            "Diarrhea ->   Sodium chloride\n",
            "Nausea ->  Percussin\n",
            "Mouthache ->     Tinnitus\n",
            "Sleeping -> \n",
            "Rash -> \n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## SEO Text / Content Writing / Copy Writing\n",
        "Lets generate some texts for our new imaginary shopify store!\n",
        "We will sell some hoodies, soap and beard products.\n",
        "\n",
        "Based on some base templates, we can generate high quality marketing/seo texts\n",
        "\n",
        "![image](https://www.awai.com/_img/content/what-is-copywriting/title_page_image.png)"
      ],
      "metadata": {
        "id": "N_oo476g_bw6",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# from https://www.shopify.com/blog/8211159-9-simple-ways-to-write-product-descriptions-that-sell \n",
        "\n",
        "# Lets define some base texts, from which we will generate new texts\n",
        "\n",
        "hoodie = \"\"\"Can’t stop buying plants? Unbeleafable. Don’t worry—us too! Cover yourself in your favourite obsession in our NEW I Love Plants Oodie! \"\"\"\n",
        "\n",
        "soap  = \"\"\"Made with real pine extract, this all-star bar is as tough as a freshly cut bat. \n",
        "A true MVP of the shower, this heavy-hitter knocks out grime with its gritty composition and ultra-manly, woodsy scent.\"\"\"\n",
        "\n",
        "beard = \"\"\" Whatever your style is, Beardbrand Styling Balm is versatile enough to handle it.\n",
        "Designed to work with all hair types, it provides enough hold to keep thick, curly hair under control, \"\"\"\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "\n"
      ],
      "metadata": {
        "id": "6S1oChzX-7oh",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "ae15d609-0d85-4072-a699-47ee7e26b88a",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "GPT2TRANSFORMER_b38120f8fb6b"
            ]
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Marketing Texts for Hoodies"
      ],
      "metadata": {
        "id": "L5tPFUnl_ebF",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Lets generate some marketing texts for our hoodies\n",
        "generate_with_gpt2(gpt2_pipe, [hoodie,hoodie,hoodie,hoodie,hoodie,hoodie])\n"
      ],
      "metadata": {
        "id": "Rtn8lRRK-7qo",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 932
        },
        "outputId": "c2468942-1736-4535-9096-451f14ed892c",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Cant stop buying plants? Unbeleafable. Dont worryus too! Cover yourself in your favourite obsession in our NEW I Love Plants Oodie!\n",
            "\n",
            "Made in UK for the largest selection, this will come in a huge container so you don't burn your hands on it - even if it's just your eyes. Great for parties and parties around the home....this is perfect for parties or when you are looking to get your feet wet and bright...this is an\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Cant stop buying plants? Unbeleafable. Dont worryus too! Cover yourself in your favourite obsession in our NEW I Love Plants Oodie!\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Cant stop buying plants? Unbeleafable. Dont worryus too! Cover yourself in your favourite obsession in our NEW I Love Plants Oodie!\n",
            "\n",
            "As for us, we already have a full colour collection of I Love the Plants pages on Amazon for just £9.99... so keep checking us out - we'd better get into it, we've got a whole bunch of lovely new stuff...\n",
            "\n",
            "We just finished our cover project for The Guardian on the\n",
            "\n",
            "\n",
            "Example 3: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Cant stop buying plants? Unbeleafable. Dont worryus too! Cover yourself in your favourite obsession in our NEW I Love Plants Oodie! The covers can be made in black or white and will also be packaged to fit one a mini dollop. You get 20ml of milk chocolate! This one won't disappoint. It's like an all natural cupcake with no flour. What more can a person ask for?\n",
            "\n",
            "Also I love the color combinations,\n",
            "\n",
            "\n",
            "Example 4: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Cant stop buying plants? Unbeleafable. Dont worryus too! Cover yourself in your favourite obsession in our NEW I Love Plants Oodie!\n",
            "\n",
            "it is so cute and simple to make it!!\n",
            "\n",
            "Make sure to save all your favourite faves on Pinterest:\n",
            "\n",
            "You can also share it with your family:\n",
            "\n",
            "\n",
            "Example 5: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Cant stop buying plants? Unbeleafable. Dont worryus too! Cover yourself in your favourite obsession in our NEW I Love Plants Oodie!\n",
            "\n",
            "This is my favourite way to put all the amazing stuff we LOVE into one. You can go for an intricate, detailed looking design (no plastic or wood pieces, anything you can think of) or start creating your own. I always make this with the idea of making a garden or two to take up space which\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  Can’t stop buying plants? Unbeleafable. Don’t ...   \n",
              "0  Can’t stop buying plants? Unbeleafable. Don’t ...   \n",
              "0  Can’t stop buying plants? Unbeleafable. Don’t ...   \n",
              "0  Can’t stop buying plants? Unbeleafable. Don’t ...   \n",
              "0  Can’t stop buying plants? Unbeleafable. Don’t ...   \n",
              "0  Can’t stop buying plants? Unbeleafable. Don’t ...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   Cant stop buying plants? Unbeleafable. Dont w...          0.8157895  \n",
              "0   Cant stop buying plants? Unbeleafable. Dont w...          0.9545455  \n",
              "0   Cant stop buying plants? Unbeleafable. Dont w...          0.8333333  \n",
              "0   Cant stop buying plants? Unbeleafable. Dont w...          0.9041096  \n",
              "0   Cant stop buying plants? Unbeleafable. Dont w...          0.8913043  \n",
              "0   Cant stop buying plants? Unbeleafable. Dont w...          0.8846154  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-bc1457fb-5d1b-47d6-88dd-ac4ce28758e1\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Can’t stop buying plants? Unbeleafable. Don’t ...</td>\n",
              "      <td>Cant stop buying plants? Unbeleafable. Dont w...</td>\n",
              "      <td>0.8157895</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Can’t stop buying plants? Unbeleafable. Don’t ...</td>\n",
              "      <td>Cant stop buying plants? Unbeleafable. Dont w...</td>\n",
              "      <td>0.9545455</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Can’t stop buying plants? Unbeleafable. Don’t ...</td>\n",
              "      <td>Cant stop buying plants? Unbeleafable. Dont w...</td>\n",
              "      <td>0.8333333</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Can’t stop buying plants? Unbeleafable. Don’t ...</td>\n",
              "      <td>Cant stop buying plants? Unbeleafable. Dont w...</td>\n",
              "      <td>0.9041096</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Can’t stop buying plants? Unbeleafable. Don’t ...</td>\n",
              "      <td>Cant stop buying plants? Unbeleafable. Dont w...</td>\n",
              "      <td>0.8913043</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Can’t stop buying plants? Unbeleafable. Don’t ...</td>\n",
              "      <td>Cant stop buying plants? Unbeleafable. Dont w...</td>\n",
              "      <td>0.8846154</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-bc1457fb-5d1b-47d6-88dd-ac4ce28758e1')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-bc1457fb-5d1b-47d6-88dd-ac4ce28758e1 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-bc1457fb-5d1b-47d6-88dd-ac4ce28758e1');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 12
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Marketing Texts for Soap"
      ],
      "metadata": {
        "id": "JAAcMbgs_hG8",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Lets generate some marketing texts for our soap\n",
        "generate_with_gpt2(gpt2_pipe, [soap,soap,soap,soap,soap,soap])\n"
      ],
      "metadata": {
        "id": "gdYVI_T6-7s8",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "921f09ea-ee53-459e-dded-9567f7d59c66",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Made with real pine extract, this all-star bar is as tough as a freshly cut bat. A true MVP of the shower, this heavy-hitter knocks out grime with its gritty composition and ultra-manly, woodsy scent.\n",
            "\n",
            "The Original Bar\n",
            "\n",
            "Ingredients: Water, Cinnamon Starch, Natural Balance\n",
            "\n",
            "Dimensions: 6\n",
            "\n",
            "Calories: 39 grams\n",
            "\n",
            "Fat: 10 grams\n",
            " of carbs per serving: 4.8 grams of protein\n",
            "\n",
            "How\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Made with real pine extract, this all-star bar is as tough as a freshly cut bat. A true MVP of the shower, this heavy-hitter knocks out grime with its gritty composition and ultra-manly, woodsy scent.\n",
            "\n",
            "5. Lilliput Follis\n",
            "\n",
            "Lilliputt follis is an incredibly versatile brew that can be made into a lot of your favorite drinks. The addition of sugar, the addition of lactose and even flotation\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Made with real pine extract, this all-star bar is as tough as a freshly cut bat. A true MVP of the shower, this heavy-hitter knocks out grime with its gritty composition and ultra-manly, woodsy scent. The bottle does a fine job of keeping me lit, but the glass also provides a nice balance of smell, texture and color. It's an easy way to impress a girlfriend, but one with a few glasses.\n",
            "\n",
            "As for the other two:\n",
            "\n",
            "\n",
            "Example 3: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Made with real pine extract, this all-star bar is as tough as a freshly cut bat. A true MVP of the shower, this heavy-hitter knocks out grime with its gritty composition and ultra-manly, woodsy scent.\n",
            "\n",
            "It also holds up to the shower after it settles down on your skin for hours.\n",
            " —Ralph Cappuccio\n",
            "\n",
            "\n",
            "Example 4: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Made with real pine extract, this all-star bar is as tough as a freshly cut bat. A true MVP of the shower, this heavy-hitter knocks out grime with its gritty composition and ultra-manly, woodsy scent.\n",
            "\n",
            "\n",
            "Made with real birch extract, citrusy vanilla leaves combine to create the ultimate summer cocktail.\n",
            "\n",
            ". This premium bar, with its rich texture and creamy scent, is a true bargain.\n",
            "\n",
            " and other fabulous brands.\n",
            "\n",
            "Click here to\n",
            "\n",
            "\n",
            "Example 5: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Made with real pine extract, this all-star bar is as tough as a freshly cut bat. A true MVP of the shower, this heavy-hitter knocks out grime with its gritty composition and ultra-manly, woodsy scent.\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  Made with real pine extract, this all-star bar...   \n",
              "0  Made with real pine extract, this all-star bar...   \n",
              "0  Made with real pine extract, this all-star bar...   \n",
              "0  Made with real pine extract, this all-star bar...   \n",
              "0  Made with real pine extract, this all-star bar...   \n",
              "0  Made with real pine extract, this all-star bar...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   Made with real pine extract, this all-star ba...          0.9107143  \n",
              "0   Made with real pine extract, this all-star ba...          0.8333333  \n",
              "0   Made with real pine extract, this all-star ba...          0.8181818  \n",
              "0   Made with real pine extract, this all-star ba...          0.9245283  \n",
              "0   Made with real pine extract, this all-star ba...          0.7857143  \n",
              "0   Made with real pine extract, this all-star ba...          0.9166667  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-7b76b8ab-f988-4b60-90ef-7cc091ec8948\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Made with real pine extract, this all-star bar...</td>\n",
              "      <td>Made with real pine extract, this all-star ba...</td>\n",
              "      <td>0.9107143</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Made with real pine extract, this all-star bar...</td>\n",
              "      <td>Made with real pine extract, this all-star ba...</td>\n",
              "      <td>0.8333333</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Made with real pine extract, this all-star bar...</td>\n",
              "      <td>Made with real pine extract, this all-star ba...</td>\n",
              "      <td>0.8181818</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Made with real pine extract, this all-star bar...</td>\n",
              "      <td>Made with real pine extract, this all-star ba...</td>\n",
              "      <td>0.9245283</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Made with real pine extract, this all-star bar...</td>\n",
              "      <td>Made with real pine extract, this all-star ba...</td>\n",
              "      <td>0.7857143</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Made with real pine extract, this all-star bar...</td>\n",
              "      <td>Made with real pine extract, this all-star ba...</td>\n",
              "      <td>0.9166667</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-7b76b8ab-f988-4b60-90ef-7cc091ec8948')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-7b76b8ab-f988-4b60-90ef-7cc091ec8948 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-7b76b8ab-f988-4b60-90ef-7cc091ec8948');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Marketing Texts for beard products"
      ],
      "metadata": {
        "id": "BYr6Fr69_jv-",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Lets generate some marketing texts for our beard products\n",
        "generate_with_gpt2(gpt2_pipe, [beard,beard,beard,beard,beard,beard])\n"
      ],
      "metadata": {
        "id": "vhGLPx2N_iUX",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 932
        },
        "outputId": "3f1efda4-52e4-4e1d-efa1-c171035146ef",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Whatever your style is, Beardbrand Styling Balm is versatile enough to handle it. Designed to work with all hair types, it provides enough hold to keep thick, curly hair under control, without compromising the beauty and stability of your natural hair.\n",
            "\n",
            "Flexible and durable, Beard Brand Styling products will stay up through the day for long-lasting results. So why is BeardBrand Styling so important?\n",
            "\n",
            "With Beard Brand, our aim is to create unique and highly-custom\n",
            "\n",
            "\n",
            "Example 1: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Whatever your style is, Beardbrand Styling Balm is versatile enough to handle it. Designed to work with all hair types, it provides enough hold to keep thick, curly hair under control, and keep you in shape, while cutting, mucking and shaping smoothly. The blend of berry leaves, sage and parsley gives this Balm a unique look unlike any other blend in store. As well as the excellent natural and organic ingredients, The Perfect One comes in a beautiful rainbow of reds,\n",
            "\n",
            "\n",
            "Example 2: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Whatever your style is, Beardbrand Styling Balm is versatile enough to handle it. Designed to work with all hair types, it provides enough hold to keep thick, curly hair under control, but not overly harsh strands.\n",
            "\n",
            "You can always customize your own personal style with each application of Beard Brand Styling Masks. A new addition, we've teamed up with the company to bring those original facial styling masks from Beard Brand to this collection.\n",
            "- Make your own Beard Brand Masks in\n",
            "\n",
            "\n",
            "Example 3: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Whatever your style is, Beardbrand Styling Balm is versatile enough to handle it. Designed to work with all hair types, it provides enough hold to keep thick, curly hair under control, and can also give you a nice line without falling over.\n",
            "\n",
            "This shampoo is compatible with every type of hair type. Whether you are looking for a very low and flexible shampoo, to break the gel down for an even flow, to a very flexible style, or simply for hair in general, we thought\n",
            "\n",
            "\n",
            "Example 4: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Whatever your style is, Beardbrand Styling Balm is versatile enough to handle it. Designed to work with all hair types, it provides enough hold to keep thick, curly hair under control, helping you stay clean and balanced.\n",
            "\n",
            "4. Facial Design\n",
            "\n",
            "Face Design is the name given for the ability to enhance your appearance with style, to make your appearance look more unique. Whether you're creating a new look for a romantic occasion or getting a stylish turtleneck, Facial is\n",
            "\n",
            "\n",
            "Example 5: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Whatever your style is, Beardbrand Styling Balm is versatile enough to handle it. Designed to work with all hair types, it provides enough hold to keep thick, curly hair under control, and it's easy peasy enough to apply, too.\n",
            "\n",
            "\n",
            "BeardBrand Styling Matte is a perfect addition to your wardrobe. Designed with the help of your hair stylist, this simple, lightweight product will stand up to the hard world of styling without any maintenance. Beard Brand Matte will make it\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  Whatever your style is, Beardbrand Styling Bal...   \n",
              "0  Whatever your style is, Beardbrand Styling Bal...   \n",
              "0  Whatever your style is, Beardbrand Styling Bal...   \n",
              "0  Whatever your style is, Beardbrand Styling Bal...   \n",
              "0  Whatever your style is, Beardbrand Styling Bal...   \n",
              "0  Whatever your style is, Beardbrand Styling Bal...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   Whatever your style is, Beardbrand Styling Ba...          0.8108108  \n",
              "0   Whatever your style is, Beardbrand Styling Ba...          0.8170732  \n",
              "0   Whatever your style is, Beardbrand Styling Ba...          0.7901235  \n",
              "0   Whatever your style is, Beardbrand Styling Ba...          0.7882353  \n",
              "0   Whatever your style is, Beardbrand Styling Ba...          0.7692308  \n",
              "0   Whatever your style is, Beardbrand Styling Ba...          0.7468354  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-71c3e7cc-6d49-4b90-87ad-76826b7cb6e8\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Whatever your style is, Beardbrand Styling Bal...</td>\n",
              "      <td>Whatever your style is, Beardbrand Styling Ba...</td>\n",
              "      <td>0.8108108</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Whatever your style is, Beardbrand Styling Bal...</td>\n",
              "      <td>Whatever your style is, Beardbrand Styling Ba...</td>\n",
              "      <td>0.8170732</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Whatever your style is, Beardbrand Styling Bal...</td>\n",
              "      <td>Whatever your style is, Beardbrand Styling Ba...</td>\n",
              "      <td>0.7901235</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Whatever your style is, Beardbrand Styling Bal...</td>\n",
              "      <td>Whatever your style is, Beardbrand Styling Ba...</td>\n",
              "      <td>0.7882353</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Whatever your style is, Beardbrand Styling Bal...</td>\n",
              "      <td>Whatever your style is, Beardbrand Styling Ba...</td>\n",
              "      <td>0.7692308</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Whatever your style is, Beardbrand Styling Bal...</td>\n",
              "      <td>Whatever your style is, Beardbrand Styling Ba...</td>\n",
              "      <td>0.7468354</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-71c3e7cc-6d49-4b90-87ad-76826b7cb6e8')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-71c3e7cc-6d49-4b90-87ad-76826b7cb6e8 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-71c3e7cc-6d49-4b90-87ad-76826b7cb6e8');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Generate Product Reccomendations\n",
        "\n",
        "We can start a base list of top 10 anything and GPT2 will complete it for us!      \n",
        "This is very useful to reccomendation and search engines.\n",
        "\n",
        "![image](https://nosto.com/wp-content/uploads/beerhawk-2-1-1024x709.png)"
      ],
      "metadata": {
        "id": "1vN8kv12_E2j",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Top 10 Movie List"
      ],
      "metadata": {
        "id": "bXlZu5ao_HUq",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "movie_list = \"\"\"My top 10 movie list : \n",
        "1. The Matrix 1\n",
        "2. The Terminator 1\n",
        "3. Scarface\n",
        "4.\"\"\"\n",
        "\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(200)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generate_with_gpt2(gpt2_pipe, movie_list)\n",
        "\n"
      ],
      "metadata": {
        "id": "tjF2bkef-7dg",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 170
        },
        "outputId": "3c68b897-3a40-4ba9-8fd1-fdc400fd7ea0",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " My top 10 movie list : 1. The Matrix 1 2. The Terminator 1 3. Scarface 4. The Hunger Games 4. Inside Amy Schumer 1 5. The Wolf of Wall Street 5. Red Dead Redemption 5. Empire 1 6. The Avengers 1 7. The Purge - Bonus Episode 1 (Featuring Chris Roberts ) 8. American Horror Story: Hotel (Original Soundtrack) 1/10: The Purging of Mr. Holmes - Soundtrack 2/10 6. A Better Tomorrow 3/10 7. House of Cards 4/10 8. The Blacklist 5/10 9. I Am Legend 6/10 10. The Walking Dead 2/50 11. Black Mirror 2/100 12. Big Love 2/150 13. The Hobbit 4/100 14. The LEGO Movie - A Song of Ice and Fire (Original Music Version) 17 * * The Grand Budapest Hotel 4. Snow White & the Seven Dwarfs 5. Elle: The Return of Eleanor 5.\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  My top 10 movie list : 1. The Matrix 1 2. The ...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   My top 10 movie list : 1. The Matrix 1 2. The...          0.7375887  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-bb29620b-de41-4d4b-bb75-aa1ddfeb7e77\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>My top 10 movie list : 1. The Matrix 1 2. The ...</td>\n",
              "      <td>My top 10 movie list : 1. The Matrix 1 2. The...</td>\n",
              "      <td>0.7375887</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-bb29620b-de41-4d4b-bb75-aa1ddfeb7e77')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-bb29620b-de41-4d4b-bb75-aa1ddfeb7e77 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-bb29620b-de41-4d4b-bb75-aa1ddfeb7e77');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 15
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "How **not** to generate text with GPT2\n",
        "\n",
        "Since every new word is conditioned on every previous **character**, things like **white spaces** and **new lines** can skew the sampling distribution into an unintended direction"
      ],
      "metadata": {
        "id": "9VZ2RmKd_K0o",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# NOTE! We have malformed the input text and added a newline. \n",
        "# This will Confuse GPT2, it might think the list is over and probability distribution changes accordingly\n",
        "movie_list = \"\"\"My Top 10 movie list : \n",
        "1. The Matrix 1\n",
        "2. The Terminator 1\n",
        "3. Scarface\n",
        "4.\n",
        "\n",
        "\n",
        "\"\"\"\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generate_with_gpt2(gpt2_pipe, movie_list)\n",
        "\n"
      ],
      "metadata": {
        "id": "O2CruoBU-7fo",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 324
        },
        "outputId": "6bc305a9-2e4d-4d32-9193-9b09f4e196be",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " My Top 10 movie list : 1. The Matrix 1 2. The Terminator 1 3. Scarface 4. X-Men 4 5. Batman V Superman 5 6. The Flash 7 8. Suicide Squad 9 10. Pirates of the Caribbean\n",
            "\n",
            "The list consists of movies that may be part of a trilogy of movies or a single movie. If no trilogy includes a Batman movie, then the only film listed will be The Matrix. Below is the list of films that are made up of several movies.\n",
            "\n",
            "1) Thor: The Dark World: An Academy Award Nomination\n",
            "\n",
            "2) Supernatural 1\n",
            "\n",
            "3) The Hunger Games: Catching Fire\n",
            "\n",
            "4) The Lion King 2\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  My Top 10 movie list : 1. The Matrix 1 2. The ...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   My Top 10 movie list : 1. The Matrix 1 2. The...          0.7766990  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-7299c78f-766b-4175-823c-0e2ea0683553\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>My Top 10 movie list : 1. The Matrix 1 2. The ...</td>\n",
              "      <td>My Top 10 movie list : 1. The Matrix 1 2. The...</td>\n",
              "      <td>0.7766990</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-7299c78f-766b-4175-823c-0e2ea0683553')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-7299c78f-766b-4175-823c-0e2ea0683553 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-7299c78f-766b-4175-823c-0e2ea0683553');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# This is how it should be done\n",
        "movie_list = \"\"\"My Top 10 worst movie list : \n",
        "1. The Matrix 4\n",
        "2. Jack and Jill\n",
        "3. Super Mario Bros. (1993)\n",
        "4. \"\"\"\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generate_with_gpt2(gpt2_pipe, movie_list)\n",
        "\n"
      ],
      "metadata": {
        "id": "Ey1izwQl-7h9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 411
        },
        "outputId": "441057a2-34e0-48bc-f771-f8d1442afa4c",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " My Top 10 worst movie list : 1. The Matrix 4 2. Jack and Jill 3. Super Mario Bros. (1993) 4. Frozen 5. The Hobbit (2001) 6. The Hunger Games (2000) 7. Harry Potter and the Half-Blood Prince 8. Inception 9. Frozen (2004)\n",
            "\n",
            "Awards\n",
            "\n",
            "In addition to being nominated for an Academy Award, all 20 movies in the list received Oscar nominations from members of the media. The list also includes the Academy Awards, Screen Actors Guild Award nominations, the Academy Award for Best Picture Awards, the Screen Actoring Awards, and The Hollywood Reporter's Best Picture nomination.\n",
            "\n",
            "Directors: Robert Downey Jr., Colin Trevorrow, Michael Green, Charles Roven, Robert J. Abrams\n",
            "\n",
            "Writer: Mark Wahlberg\n",
            "\n",
            "Director: Andrew Stanton\n",
            "\n",
            "Starring: Anna Paquin, Benicio Del Toro, Daniel Day-Lewis\n",
            "\n",
            "Release Date: October 17, 2016\n",
            "\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  My Top 10 worst movie list : 1. The Matrix 4 2...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   My Top 10 worst movie list : 1. The Matrix 4 ...          0.8048780  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-c5edd227-1a0a-493f-aa17-04fd282f64f4\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>My Top 10 worst movie list : 1. The Matrix 4 2...</td>\n",
              "      <td>My Top 10 worst movie list : 1. The Matrix 4 ...</td>\n",
              "      <td>0.8048780</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c5edd227-1a0a-493f-aa17-04fd282f64f4')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-c5edd227-1a0a-493f-aa17-04fd282f64f4 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-c5edd227-1a0a-493f-aa17-04fd282f64f4');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 17
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Top 10 Game List"
      ],
      "metadata": {
        "id": "NW8etz72_Xvw",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "game_list = \"\"\"My Top 10 video games of all time list : \n",
        "1. Half-Life 2\n",
        "2. Super Smash Bros Melee\n",
        "3. Portal\n",
        "4. \"\"\"\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generate_with_gpt2(gpt2_pipe, game_list)\n",
        "\n"
      ],
      "metadata": {
        "id": "vz6LjlFl-7kT",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 307
        },
        "outputId": "781024fe-9ce3-40ff-dd17-e46ac5e64aee",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " My Top 10 video games of all time list : 1. Half-Life 2 2. Super Smash Bros Melee 3. Portal 4. Super Mario 3D World 5. Fire Emblem: Awakening 6. Pokemon Generations and the other 2:\n",
            "\n",
            "1. Fire & Dark 2 2 3. Pokemon XD: Gale of Darkness 3 4. Dragon Quest XI Fire Emblem Heroes 6. Super Stardust Crusaders 8. Far Cry Primal 9. Deus Ex: Mankind Divided 10. Super Street Fighter IV - Best Street Fighter VI - Capcom Vs. SNK 2011-10-03 01:05:39\n",
            "\n",
            "\n",
            "Bizarre Creations - Video Game Scoreboards for Nintendo 3DS\n",
            "\n",
            "Click Here Now\n",
            "\n",
            "Play in this site for your Nintendo 3D TV, TV, or portable device to see the best gameplay videos from all these games. Find video games that have the highest scoreboard scores, top scoreboard entries, and the best video from the Super Nintendo Super Nintendo 5 game series in this list\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  My Top 10 video games of all time list : 1. Ha...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   My Top 10 video games of all time list : 1. H...          0.7310345  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-493d5d16-4277-4846-9248-5a4ef0fc5ad4\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>My Top 10 video games of all time list : 1. Ha...</td>\n",
              "      <td>My Top 10 video games of all time list : 1. H...</td>\n",
              "      <td>0.7310345</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-493d5d16-4277-4846-9248-5a4ef0fc5ad4')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-493d5d16-4277-4846-9248-5a4ef0fc5ad4 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-493d5d16-4277-4846-9248-5a4ef0fc5ad4');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 18
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Top 10 Book List"
      ],
      "metadata": {
        "id": "wP_Bt14I_Z7F",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "game_list = \"\"\"My Top 10 books of all time list : \n",
        "1. 1964\n",
        "2. Moby Dick\n",
        "3. robinson crusoe\n",
        "4. \"\"\"\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generate_with_gpt2(gpt2_pipe, game_list)\n",
        "\n"
      ],
      "metadata": {
        "id": "8W4Nh0BP-7ma",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 170
        },
        "outputId": "0d203468-deb1-4cbe-dc14-d83b8a277ea5",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " My Top 10 books of all time list : 1. 1964 2. Moby Dick 3. robinson crusoe 4. Blonde Vapour 5. Kitten by Mary and Bill 6. Love & Sex 7. The Princess Bride 8. The Lord of the Rings 9. Harry Potter 10. I'm Going Home 11. Man Without Bodies . . . 12. Harry Styles 12. Temptations ... 13. The Lion King 14. The Wizard of Oz 15. J.K. Rowling 16. The Long Goodbye 17. Unbelievable Things 18. The Diary of Anne Rice 19. The Art of Misadventure 20. Jumanji for Girls 21. The King to the Rescue 22. The Book of Life 23. A Diary of a Love Story 24. Bollywood 10/11/16 25. The Adventures of Dr. N (or any other book you'd like the people to buy if they like The Dark Knight and all…) 26. A Christmas Carol 27\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  My Top 10 books of all time list : 1. 1964 2. ...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   My Top 10 books of all time list : 1. 1964 2....          0.7931034  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-28285492-a79f-4670-a171-64b9321a4682\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>My Top 10 books of all time list : 1. 1964 2. ...</td>\n",
              "      <td>My Top 10 books of all time list : 1. 1964 2....</td>\n",
              "      <td>0.7931034</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-28285492-a79f-4670-a171-64b9321a4682')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-28285492-a79f-4670-a171-64b9321a4682 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-28285492-a79f-4670-a171-64b9321a4682');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 19
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Song Lyrics\n",
        "![image](https://nationaltoday.com/wp-content/uploads/2021/05/Sing-Out.jpg)"
      ],
      "metadata": {
        "id": "xiEumf9Y_l3N",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Last Christmas"
      ],
      "metadata": {
        "id": "AYEUiKyS_mvg",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "last_christmas = \"\"\"\n",
        "Last Christmas, I gave you my heart\n",
        "But the very next day, you gave it away\n",
        "This year, to save me from tears\n",
        "I'll give it to someone special\n",
        "\"\"\"\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "generate_with_gpt2(gpt2_pipe, last_christmas)"
      ],
      "metadata": {
        "id": "UBQzpC8Q-7vL",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 552
        },
        "outputId": "b8d82ced-8c26-4364-ca57-bae2de4264be",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Last Christmas, I gave you my heart But the very next day, you gave it away This year, to save me from tears I'll give it to someone special These are the days that we're together, when we smile, when it isn't sad and when we don't think about what's in the world for days until it's over So if I'm gonna go, I have to go This month that I'm not gonna be afraid For the second time it'll be with you And I'm going to try to take it all.\n",
            "\n",
            "What do you mean that?\n",
            "\n",
            "I try to go through those moments, I try to make them so that I can give them out I'm trying to take them down And it takes me, you know, it's like a gift, you can never forget and I feel so grateful for it.\n",
            ", but, there's a feeling in the air, maybe one I've gotten to a point where I just don't cry, because we really are together forever This was in the most basic of terms a chance I got to give and give someone a moment to hold me to my word.\n",
            ". . .\n",
            "\n",
            "Well that's it?\n",
            " (to Linda) That's what I'm doing.\n",
            "…\n",
            "\n",
            "Are you feeling better? Maybe you have other things going on? Or just are you...I can tell you're feeling better. I'm in a weird place, so I'm making arrangements for myself, it doesn't really make sense, but I'm talking about talking to the guys. I have other thoughts, which is good to hear because I don't have to talk to them.\n",
            " and, it seems to me, I think we just haven't discussed it in an intelligent way. You know, when I say that I feel like a new person too I'll say that it's so amazing or some stuff like that, I'm able to think about and make up things for myself. So yeah, no, that's really it, I haven't been saying anything to you about this and it's just so amazing that I could say that that could be. At the end of the day, just because I just went through that moment where I'm like, oh you want nothing more to do with this, you're here, maybe you want that, maybe it's a chance, and I want to move to better things you do with your life . . . I'm always thinking about myself, you have to be on the guard.\n",
            " in a way, a lot of people's life is built around not talking but rather taking chances to think. I really want to make it clear that I want you to have fun.\n",
            "\"No, I want nothing to do.\"\n",
            "\n",
            "\"I don't want to know how to ask. And I never will.\"\n",
            " [on how the other guys dealt with him] \"You're in a strange place. You're alone. Don't think twice about what you need. You never ask anything different from what it has to be the day that I have. I never do.\" [on getting to the hospital after seeing his first cancer] \"What time was your surgery back? It's always been my job. But that surgery was my mission to do and that's always happened to me. Now I can't wait for it to come over in my face. My new mission. What a wonderful mission. Just a fine job. It was a wonderful, lovely job, and now I've got to do what it takes, and just try to be a better person, and try to get a better life for myself and a better world.\" So, you look at that. Well I do. I would not describe it as not having fun, you could say it means that in a sense it might look like you're just done. I was always just doing things that I didn't want people to do. When you're making decisions like that it can lead to problems. You end up at a place where you're very depressed. You try to think different things and there's really no rational place anymore for that. People are getting hurt, and that is always really damaging for people, because it's always meant to come back. So it's only logical that once again that we could do anything except we're doing it wrong, in a very specific way.\"\n",
            " (the two of them go on a journey to talk before the finale to talk a lot) \"So, I feel good. All I want is someone that I think can change things.\"\n",
            " 1\n",
            "\n",
            "In previous interviews we've used the phrase 'people who may or may not have been involved in this.' As you can see, we are all trying to come up with the best way to sum up the moments between the two of you. If someone has all the answers that we've given, they can come up to you and ask you how they believe,\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  Last Christmas, I gave you my heart But the ve...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   Last Christmas, I gave you my heart But the v...          0.4331683  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-d7a03c51-da4c-42cc-adcf-11d46d7fc012\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Last Christmas, I gave you my heart But the ve...</td>\n",
              "      <td>Last Christmas, I gave you my heart But the v...</td>\n",
              "      <td>0.4331683</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d7a03c51-da4c-42cc-adcf-11d46d7fc012')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-d7a03c51-da4c-42cc-adcf-11d46d7fc012 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-d7a03c51-da4c-42cc-adcf-11d46d7fc012');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 20
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Fresh prince of GPT"
      ],
      "metadata": {
        "id": "zzFvBSnH_oVF",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "bel_air = \"\"\"\n",
        "Now, this is a story all about how\n",
        "My life got flipped-turned upside down\n",
        "And I'd like to take a minute\n",
        "Just sit right there\n",
        "I'll tell you how I became the prince of a town called Bel-Air\n",
        "\"\"\"\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "generate_with_gpt2(gpt2_pipe, bel_air)"
      ],
      "metadata": {
        "id": "_NBq05Cu-7xY",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 292
        },
        "outputId": "44284d2a-3b79-431e-ad52-50c21977376d",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Now, this is a story all about how My life got flipped-turned upside down And I'd like to take a minute Just sit right there I'll tell you how I became the prince of a town called Bel-Air\n",
            "\n",
            "My childhood friend, a musician, named me when I was an infant. I could be anywhere. My parents were great musicians, who didn't care if I was a musician or not. He had a small musical brain and he could play. I was always learning my song. He listened to this music like an ankh. I knew nothing about the different sounds that pop in and out like I played in a band. There was nothing real about him except the melodies that I was seeing. He played his music, listened to it, and became \"great.\" I had never heard the \"Great Teacher!\" song. I came to love him as soon as I heard it, at first. I began to become so good at this music, that I'd get too close, too much to any other person, to get near him that I really didn't want to get close to Him and was afraid his friends would find out. He was one of the very few people in the world who could hear the music, and everyone had at the time, but then, that changed. Suddenly, I was surrounded in a world of all-good musicians like a baby, who was never in pain. Nobody could hear what I was doing at that moment. Once, he suddenly, it was almost impossible to find out what I wanted to be. I'd come to love his music.\n",
            "\n",
            "We moved to Los Angeles to play in a club in the early Eighties. My friends and I had a lot of friends in the band. They were so good friends. And one of them was an old man who had just arrived in the car all hot to the point where he had to put on his sunglasses and put a tie on to the steering wheel, and he had just hit a car. He's a white guy, he's an old guy, with no hair, blue jeans, and a brown jacket. He looks and sounds like his mother. He wasn't the kind of person who was going to help you find out who you were or where you came from. We went to church and tried to figure out what really happened. We talked about the songs, and then my brother would go to an auto service and say the lyrics to that song and my friend would say it. We would find him the following day a little bit, a little after we'd gone to church. We met at McDonald's and we had all kinds of people playing with us. People would say, \"Come on, we'd all love that.\" We'd go for a walk and we would always come back to McDonald's so we could get a seat and take pictures with the whole bunch. That's where We would stop and go to get some drinks and hang out. Even if we went into the street, it didn't matter in the least, because the car would tell us that we were going to have to go through some sort of security by one of three streets and that the car, one of these, would be the end of everybody's life. It was too important to hide. We never had to tell the whole story. We were just friends. Even at about 12, I could remember once when I told my dad, a young man, I'm very good at doing things when I'm really cold, on the outside, like, with all those colds under my nose when I take my phone or my sweater off.\n",
            " [laughs] But that wasn't that. I would come back and find that the whole car was gone. My father saw us go to church in that car, that car was never gone and there were so many cars all about in the park. And of course, once I got back down there and took pictures with those strangers who had gone through so many car stuff, he was just so happy that I would finally go see The Godfather. He would come down the park and meet me and tell me that I got shot. They started having fun with me and talking about the cars and the police and the kids. He didn't really pay attention to that stuff. All these stories, but now he's gone and he is, like his real life is complete now it's out of here and now people are asking me \"What's he doing all by himself?\"\n",
            "\n",
            "[laughs] My dad said to me by the time I got to the hospital I was pretty sure that there was no way him to get me out of there. So, I started to do everything I could to keep him from being released. If people asked me this question I'd say no. I wasn't allowed to sit there and ask what's wrong and why I was not in my home state of California for three weeks,\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  Now, this is a story all about how My life got...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   Now, this is a story all about how My life go...          0.4252336  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-88b8693a-874e-4387-bda3-58e4bfd20723\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Now, this is a story all about how My life got...</td>\n",
              "      <td>Now, this is a story all about how My life go...</td>\n",
              "      <td>0.4252336</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-88b8693a-874e-4387-bda3-58e4bfd20723')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-88b8693a-874e-4387-bda3-58e4bfd20723 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-88b8693a-874e-4387-bda3-58e4bfd20723');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 21
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### GPT Jackson"
      ],
      "metadata": {
        "id": "qJr6PsWD_qcT",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "thriller_lyrics = \"\"\"\n",
        "They told him, \"Don't you ever come around here\"\n",
        "\"Don't wanna see your face, you better disappear\"\n",
        "The fire's in their eyes and their words are really clear\n",
        "So beat it, just beat it\n",
        "You better run, you better do what you can\n",
        "Don't wanna see no blood, don't be a macho man\n",
        "You wanna be tough, better do what you can\n",
        "So beat it, but you wanna be bad\n",
        "Just beat it (beat it), beat it (beat it)\n",
        "No one wants to be defeated\n",
        "Showin' how funky and strong is your fight\n",
        "It doesn't matter who's wrong or right\n",
        "Just beat it (beat it)\n",
        "Just beat it (beat it)\n",
        "\"\"\"\n",
        "\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(500)\n",
        "generate_with_gpt2(gpt2_pipe, thriller_lyrics)"
      ],
      "metadata": {
        "id": "CSYNeDBO_tAX",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 240
        },
        "outputId": "dc1d05e5-adb3-436b-b54a-1e75e378d7bf",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " They told him, \"Don't you ever come around here\" \"Don't wanna see your face, you better disappear\" The fire's in their eyes and their words are really clear So beat it, just beat it You better run, you better do what you can Don't wanna see no blood, don't be a macho man You wanna be tough, better do what you can So beat it, but you wanna be bad Just beat it (beat it), beat it (beat it) No one wants to be defeated Showin' how funky and strong is your fight It doesn't matter who's wrong or right Just beat it (beat it) Just beat it (beat it)\n",
            "\n",
            "RAW Paste Data\n",
            "\n",
            "You're gonna die If you don't know it, then you're going to die Showin, this isn't real, I'm serious Your fuckin' a man, man You'll die this fuckin' time, you die this time. You got better than bad You gon get me dead, we'll be gonna die together, we're gonna get a girl and we'll do this Showin', this isn' real, like it's a film You're getting more and more bad, I got all day, you can't let I get you away, I'll hit you in the ass with my gun, I won't shoot you (the gun) When you fuck me with it, you gonna kill me Showin,' this isn'. It's this fight All time, we gotta go, we get into it Showin'. It will give me back my life, if you wanna come together, here comes my friends Showin You're a good guy, but a little bit is still better Don't want to do what he's saying, don' wanna do what we tell you to, like, beat him up, just kill him Showin',\" you're gonna know you're not dead, showin'. Nothin' gonna happen to you, just gonna keep fighting Showin\" That's just not true\" I'm the same one as you, they'll make you a different one Showin\", or 'We don't want the same things, we want good ones', showin', or 'Good days are always tough, never make fun of the same people, fight the same battles, get a good fight, be a good soldier, be able to work with both sides.\" (and his two friends don't show this, they laugh at the words\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  They told him, \"Don't you ever come around her...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   They told him, \"Don't you ever come around he...          0.5452128  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-865198f6-07d3-4125-a2bc-1cad9a0d7638\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>They told him, \"Don't you ever come around her...</td>\n",
              "      <td>They told him, \"Don't you ever come around he...</td>\n",
              "      <td>0.5452128</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-865198f6-07d3-4125-a2bc-1cad9a0d7638')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-865198f6-07d3-4125-a2bc-1cad9a0d7638 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-865198f6-07d3-4125-a2bc-1cad9a0d7638');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 22
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Real Slim GPT"
      ],
      "metadata": {
        "id": "iBGccgBY_qhI",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "real_slim_shady = \"\"\"\n",
        "May I have your attention, please?\n",
        "May I have your attention, please?\n",
        "Will the real Slim Shady please stand up?\n",
        "I repeat, will the real Slim Shady please stand up?\n",
        "We're gonna have a problem here?\n",
        "Y'all act like you never seen a white person before\n",
        "Jaws all on the floor like Pam, like Tommy just burst in the door\n",
        "And started whoopin' her ass worse than before\n",
        "They first were divorced, throwin' her over furniture (ah!)\n",
        "It's the return of the, oh wait, no way, you're kidding\n",
        "He didn't just say what I think he did, did he?\n",
        "And Dr. Dre said-, nothing you idiots\n",
        "Dr. Dre's dead, he's locked in my basement (haha)\n",
        "\"\"\"\n",
        "\n",
        "\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "generate_with_gpt2(gpt2_pipe, real_slim_shady)"
      ],
      "metadata": {
        "id": "_2uYnoMq_vc2",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 205
        },
        "outputId": "8d0bceb7-e2a2-4724-bbef-10aee37a85ff",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " May I have your attention, please? May I have your attention, please? Will the real Slim Shady please stand up? I repeat, will the real Slim Shady please stand up? We're gonna have a problem here? Y'all act like you never seen a white person before Jaws all on the floor like Pam, like Tommy just burst in the door And started whoopin' her ass worse than before They first were divorced, throwin' her over furniture (ah!) It's the return of the, oh wait, no way, you're kidding He didn't just say what I think he did, did he? And Dr. Dre said-, nothing you idiots Dr. Dre's dead, he's locked in my basement (haha) A dead man who's been outed in this city for over a year. Or, like Mr. Slim, he got this big ass deal and the name Slim Shader was right by him, that he was a thug And this shit was made up when I was 5 This dude is so rich and shit, he thinks the poor nigga is too poor to get by. No, look how fuckin' hard he's got in his little ass, his big ass in me and all him, like his little cock-n-fucker, like he's still fucked And there he is, in my home and his life, a nigga living right down the barrel of a shotgun No, what is this. (Laughter.) There no, what this! He wants to be a thug, you know it's bad I'm gonna get busted for having, like you know, little balls on my ass, you like you ain't a thug? The real Slim has just had his shit together, his house is the real deal I'm pissed he's dead because he bought us a house that he sold to me And now I love his ass, but this is not just the real me, that's it. Why is this guy in my mind, and these, these, all the crap that he's ever made that will be gone, and what did he do to me when I came to get him, or to get me, because he thought I was a real deal, like I'm an asshole? (Smack-up scream) No, that girl is going to come back, you fucking bitch, she's gonna walk up your neck like a fuckin dick! (Laugh.) She's going to say, you look at me and shit like in a fucking pit of yours You're gonna, she fucking can't talk like a motherfucker You fucking bitch! (She says it's such a stupid thing to say to a nigger, you really thought this was good when you didn't look like she was going to get pissed off at you.) Oh, what I have to say is, man I had better have your shit together now you niggers, like a good bitch like the one, you real good bitch, like yours real good. If I got you like a house with all these dope, bitchz, weed, and weed weed and weed and I fucked up my house and fucked you up like you shit was getting me off. It's your big ass, I got a huge ass, and I got your ass too. Fuck, you can fuck him. He'd say I look stupid to say this to a person, because I just want to fuck her with his little gun. I'm fucking tired of being a bitch about shit I've never done, because this nigga said he fucking had to give a fuck to me, you fuckin filthy man, you got no sense he could fuck me, to do it like this, for no money at all. I mean, you saw him, and you're here right now, your life's fucked, and somebody hit him and all that shit. So, man, this nigger is all that you know. (Crowd laugh) This is bullshit, I was fucking in an argument with a nice, old friend. That friend thought I should fuck up a house, this is bullshit. I thought I could let him live anywhere he wanted, he was gonna pay for the house, or whatever it was. He was actually, we'd had a real friendship there for some time. I didn't care if he'd fuck me. I just wanted to get down, I wanted it to be fun. He didn?t tell me, and then he said, do you think he could do a deal to be paid off? It's this shit right here, this shit, and it's just that we had this very, very long relationship, and we always had two daughters, it's a long relationship. I never, he said it might be the last time we had a kid together. I was like, fuck shit, you think you can't be happy?\n",
            "\n",
            "I am not a jerk\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  May I have your attention, please? May I have ...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   May I have your attention, please? May I have...          0.4375788  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-c9cbb320-6c10-4e0f-bee8-e3ecc9301a78\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>May I have your attention, please? May I have ...</td>\n",
              "      <td>May I have your attention, please? May I have...</td>\n",
              "      <td>0.4375788</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c9cbb320-6c10-4e0f-bee8-e3ecc9301a78')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-c9cbb320-6c10-4e0f-bee8-e3ecc9301a78 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-c9cbb320-6c10-4e0f-bee8-e3ecc9301a78');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 23
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### GPT Rap God Bot"
      ],
      "metadata": {
        "id": "NLZhtZzr_yIN",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "rap_god_lyrics = \"\"\"\n",
        "Look, I was gonna go easy on you not to hurt your feelings\n",
        "But I'm only going to get this one chance (six minutes-, six minutes-)\n",
        "Something's wrong, I can feel it (six minutes, Slim Shady, you're on!)\n",
        "Just a feeling I've got, like something's about to happen, but I don't know what\n",
        "If that means what I think it means, we're in trouble, big trouble\n",
        "And if he is as bananas as you say, I'm not taking any chances\n",
        "You are just what the doc ordered\n",
        "I'm beginnin' to feel like a Rap God, Rap God\n",
        "All my people from the front to the back nod, back nod\n",
        "Now, who thinks their arms are long enough to slap box, slap box?\n",
        "They said I rap like a robot, so call me Rap-bot \"\"\"\n",
        "\n",
        "\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "generate_with_gpt2(gpt2_pipe, rap_god_lyrics)"
      ],
      "metadata": {
        "id": "0Vx-4xlT_vfH",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 274
        },
        "outputId": "d9772f8e-5a64-4906-e0e4-4a1742fe5ce3",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Look, I was gonna go easy on you not to hurt your feelings But I'm only going to get this one chance (six minutes-, six minutes-) Something's wrong, I can feel it (six minutes, Slim Shady, you're on!) Just a feeling I've got, like something's about to happen, but I don't know what If that means what I think it means, we're in trouble, big trouble And if he is as bananas as you say, I'm not taking any chances You are just what the doc ordered I'm beginnin' to feel like a Rap God, Rap God All my people from the front to the back nod, back nod Now, who thinks their arms are long enough to slap box, slap box? They said I rap like a robot, so call me Rap-bot, but you're a very different animal?\n",
            "\n",
            "RAW Paste Data\n",
            "\n",
            "Danger in the Water Don't you like my dick, my ass, my cock Is it your cock that you wanna come in the water? Oh, you, it's my cock and I am your body. I'm your little hand on your ass, or your little ass, you wanna fuck me, don't you wanna cum Don't tell me you can't suck his cum don't give me a break Don't believe me tell me and don't tell my father don't think you're strong enough Don't call me like a bitch, but it's all you're capable of in this position I can barely get to your cunt I don... I know (just a little bit!)\n",
            "\n",
            "THE END\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  Look, I was gonna go easy on you not to hurt y...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   Look, I was gonna go easy on you not to hurt ...          0.5968379  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-49d855e6-90bb-41f1-ba74-a87f787f317d\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Look, I was gonna go easy on you not to hurt y...</td>\n",
              "      <td>Look, I was gonna go easy on you not to hurt ...</td>\n",
              "      <td>0.5968379</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-49d855e6-90bb-41f1-ba74-a87f787f317d')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-49d855e6-90bb-41f1-ba74-a87f787f317d button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-49d855e6-90bb-41f1-ba74-a87f787f317d');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 24
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Fiction Writing\n",
        "\n",
        "![image](https://botnik.org/content/hp01.jpg)\n",
        "\n",
        "You can write and sell [books](https://botnik.org/content/harry-potter.html) "
      ],
      "metadata": {
        "id": "femLar3e_0AF",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Fantasy Stories"
      ],
      "metadata": {
        "id": "LZBPsthM_1oJ",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "\n",
        "silly_story = \"\"\"\n",
        "Blue creatuers have been found in underground cavern in Maxico, which are mysteriously filled with peanut butter pockets.\n",
        "These pockets are rich in proteins and enabled a civilization of these creatures to flurish and \n",
        "build technologies which have surpassed human understanding. Scientist have studied these\n",
        "creataures and found incredible discoveries, including their mastery of the Spanish language and perfect Tacos!\"\"\"\n",
        "\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "generate_with_gpt2(gpt2_pipe, silly_story)"
      ],
      "metadata": {
        "id": "R9HzQAtx_vjt",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 222
        },
        "outputId": "9f5f7075-8cb0-4f55-c227-652c6587e9e0",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " Blue creatuers have been found in underground cavern in Maxico, which are mysteriously filled with peanut butter pockets. These pockets are rich in proteins and enabled a civilization of these creatures to flurish and build technologies which have surpassed human understanding. Scientist have studied these creataures and found incredible discoveries, including their mastery of the Spanish language and perfect Tacos! The ancient secret to their building of utensils, which have been discovered by archaeologists worldwide and scientists living in New York City, Los Angeles, Paris, Switzerland, India, Italy, Hong Kong, China and other places was a secret sauce which we do not fully understand, so no one can believe that these new discoveries will allow us to see how they might be used or even to decipher them. It's interesting how we can make the same discoveries the same way in a single day without knowing how much research I'll be able to do. We can also understand one another! Now, as an aside and also to help you understand what I mean here, we also have this amazing infographic to add, that tells you the true story of the ancient world, so it's easy to understand how this could work out:\n",
            "\n",
            "\n",
            "I hope to share this knowledge in a small way:\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  Blue creatuers have been found in underground ...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   Blue creatuers have been found in underground...          0.6981132  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-43f815f6-8b9a-4068-a874-018f93549df7\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Blue creatuers have been found in underground ...</td>\n",
              "      <td>Blue creatuers have been found in underground...</td>\n",
              "      <td>0.6981132</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-43f815f6-8b9a-4068-a874-018f93549df7')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-43f815f6-8b9a-4068-a874-018f93549df7 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-43f815f6-8b9a-4068-a874-018f93549df7');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 26
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Lovecraftian Horror Stories "
      ],
      "metadata": {
        "id": "WNRKnBa8_5WP",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "cthulu_intro = \"\"\"\n",
        "The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. \n",
        "We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. \n",
        "The sciences, each straining in its own direction, have hitherto harmed us little\"\"\"\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "generate_with_gpt2(gpt2_pipe, cthulu_intro)"
      ],
      "metadata": {
        "id": "RqeOkWiS_vn5",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 344
        },
        "outputId": "97d6c224-a42e-4270-845d-8f338eb4251f",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little by little; but now it is so much greater a threat to our very lives that we ought not merely to resist it, but to give up to it what we have. The only real hope is a world free from the superstitions, illusions and illusions of the past in which our humanity has been made, and inhabited by the children of the people who lived around us, in accordance with the same moralism and superstition, as we live today; and of these we know the world will never become so. All these things are not quite as it is, when we have learned enough to understand that we cannot imagine the world without it and to become conscious of the fact that there must be such a mind in all its manifestations in every sphere of life.\n",
            "\n",
            "Of course, my remarks to you are much longer, and I will not try to answer your questions again because they are, I assure you, somewhat long. But for many years I have been in a very happy life and now want to do a more detailed analysis, and of course I believe you will have a better view than I have. But the last question I want then to say now is: if I are obliged to write that I think you shall never be the true doctor of all this nonsense, I will say to you that as much of it is pure nonsense as you believe yourself to be, and even if you will admit it, my purpose does not have any basis except to say that no man ever would have a life in this sort, but I shall endeavour to prove it. Therefore I say: I shall be your doctor of a mind which is entirely ignorant and impotent and which can even have no conception of life or anything. If you have any sense that this mind is to come into being, you will be the first to know that your treatment will not be satisfactory, and no one is to have more than this mind, for even the doctor who is to be your master will be to be quite ignorant. The Doctor must have other ideas, and that only takes his care of himself. A certain idea will not suffice to make a man your master and that does not make him your doctor, and this may be my point: but if we will not do the Doctors' work, what will in fact be the Doctor's idea? I do not even see the idea of a Doctor without his doctor; or whether it can be expressed as one without the other. Nor can I say more in a short answer than you will understand by listening to something which you have already heard clearly, on reading what I think on my own subject. My point is that it appears that our whole life is so miserable that even the least thing might be regarded as a deathbed for you. In spite of all your efforts at medical education, your treatment and every other thing, there is only a certain idea of the world to which that which is to live is to go in such a direction, and to a certain point. You will never be your Master if you do not put an end to this idea. So you will feel that your doctor has the same idea and do anything to protect himself from this possibility but should for a time attempt to do what the Doctor says will not go in this manner: or, if you think it will, you shall be treated, not so much as to take away that idea. But I think that if our whole lives are so miserable, and therefore not so good, I wish I could put an ends to this nonsense.\n",
            ".\n",
            "\n",
            "\n",
            "\n",
            "CHAPTER xII.\n",
            "- There may be not in our own country no more good for the world than there have been, and they too could not exist in this world as we are. .\n",
            "\n",
            "- Let it not come to pass, that we may not take up all those hopes which may have been so many years ago, that no one may expect a future from us in this kind, we may but do so, even to ourselves. For we have a real intention. The good doctor may be thought of more as a philosopher. The foolish and absurd would, more or less, be thought less of a fool, and will look down upon those who give heed to the vain. Let such a man not be considered as an evil-doer; but as a doctor, a doctor's friend, a physician's neighbour-in-law, and a physician-instraint. We may be all the more sure that he is of a good work, that the doctor will not have to keep looking to a sick man's body to discover the reason why he\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  The most merciful thing in the world, I think,...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   The most merciful thing in the world, I think...          0.3911565  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-38af4d42-c1e3-49fb-9f1f-10eebd718d74\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>The most merciful thing in the world, I think,...</td>\n",
              "      <td>The most merciful thing in the world, I think...</td>\n",
              "      <td>0.3911565</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-38af4d42-c1e3-49fb-9f1f-10eebd718d74')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-38af4d42-c1e3-49fb-9f1f-10eebd718d74 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-38af4d42-c1e3-49fb-9f1f-10eebd718d74');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 28
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Re-Write movie scripts"
      ],
      "metadata": {
        "id": "O98T1TL3_7pt",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "the_matrix_script = \"\"\"\n",
        "The Architect: Hello, Neo.\n",
        "\n",
        "Neo: Who are you?\n",
        "\n",
        "The Architect: I am the Architect. I created the matrix. I've been waiting for you. \n",
        "You have many questions, and although the process has altered your consciousness, you remain irrevocably human. \n",
        "Ergo, some of my answers you will understand, and some of them you will not. \n",
        "Concordantly, while your first question may be the most pertinent, you may or may not realize it is also irrelevant.\n",
        "\n",
        "Neo: Why am I here?\n",
        "\n",
        "The Architect: \"\"\"\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "generate_with_gpt2(gpt2_pipe, the_matrix_script)"
      ],
      "metadata": {
        "id": "049s4qC1_vqY",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 309
        },
        "outputId": "c9f57cf4-3b88-44ad-dcaa-d8863ab58c55",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " The Architect: Hello, Neo. Neo: Who are you? The Architect: I am the Architect. I created the matrix. I've been waiting for you. You have many questions, and although the process has altered your consciousness, you remain irrevocably human. Ergo, some of my answers you will understand, and some of them you will not. Concordantly, while your first question may be the most pertinent, you may or may not realize it is also irrelevant. Neo: Why am I here? The Architect: To let you go. You will need to prove to yourself that there is a way out, as I can give you a solution. That was what I was looking for, and now you have the answer. But where is my life left to go? Neo: There is only one choice. The Architect. Neo's name is Giro: The Architect? The Architecture: It is called the Matrix. And just so you understand, there is nothing beyond that. Where does my life ever end? I am a piece of shapeless stone. Neo wants to stop thinking. He wants to go into a new life in his own body through the Architect who was created only a day ago. He is an object who controls how you change. He will only allow you to choose yourself through the Process... The Architect and the Matrix remain as one. The only difference is that the former remains in control of how your consciousness is created so that you can continue to do what you do all day long... When you die, these are the two pieces of the Matrix which have been created. The Matrix is your consciousness. The two are connected, the left and the right. The left controls how well your body functions and the ability of your brain to process events. But the right controls you. The end is you.\n",
            "\n",
            "A timeline is a timeline set by the Architect or his creators that ends in a single point. It is the first of three final events to occur at a specific point within our lives. This timeline acts independently of the other, allowing us to follow the Architect and his creations wherever they are needed, but we are only able to know from our current point in time, without our knowledge of the Architect's designs.\n",
            ", which, in turn, gives us all our powers of perception of the future that we might possess. The human form of the timeline is the mind of one Architect, or at the very least, it is our sense of who we are, not how we are created. It gives us much of the power to act independently of one another and to guide us if need be to change the world (at our will).\n",
            "\n",
            "This, of course, is where the power of thought comes in. The mind of either one of the three Architects will allow us to perceive the world or to create it.\n",
            " - I must ask, what is the point of all this, then? If you think it is possible that something is created simply by chance, then you are not alone within the Matrix, but also within your own life as well. What does our personal story tell us about the future? If this person is not simply out of your control, then who was it who created us? The truth is, our fate is determined not by our actions, as we choose, but by who we believe to be our future selves. You don't need to be in a complete control of the subject you're telling me, rather, your decisions give the final say over what your decisions take.\n",
            " - As the Architect says in his first lecture, if you choose to do something you don't want to do, then your life is at risk. If, on the other hand, you want to let go, then the idea of making a change cannot hurt your well-being. A decision at this stage is not what you put forward for yourself, but the outcome of that decision in your life. It would be the end of the past, something you can learn to live with. You can see the future instead of waiting for the future to come. And if you've made that choice, then perhaps the whole story is true, or it isn't. If you're not ready and willing to change it at least until the next step: the next life, then a future that you care about. You are your future, and you have decided it, regardless of what you are feeling. The same happens to your body, as well as to yourself. You need to decide what makes you or you don, if someone makes you hurt, or want you to hurt. It's your decision as a person.\n",
            "- I like to think \"just do what I want.\" Why are you in control? Why do you feel so strongly about it? It's up to you. What I like about living the life I want is the ability to be a good person with a positive attitude in a positive light\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  The Architect: Hello, Neo. Neo: Who are you? T...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   The Architect: Hello, Neo. Neo: Who are you? ...          0.4321133  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-1eb661d0-acb6-4f3d-aa33-c3155b705560\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>The Architect: Hello, Neo. Neo: Who are you? T...</td>\n",
              "      <td>The Architect: Hello, Neo. Neo: Who are you? ...</td>\n",
              "      <td>0.4321133</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1eb661d0-acb6-4f3d-aa33-c3155b705560')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-1eb661d0-acb6-4f3d-aa33-c3155b705560 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-1eb661d0-acb6-4f3d-aa33-c3155b705560');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 29
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## GPT for Programing Code\n",
        "\n",
        "\n",
        "![image](https://static01.nyt.com/images/2022/04/04/multimedia/15ai-nocode/15ai-nocode-mobileMasterAt3x.jpg)"
      ],
      "metadata": {
        "id": "QPpwJVU3-xJA",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "###  Scala"
      ],
      "metadata": {
        "id": "pggmo3lJ-y7Q",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Hello world in Scala\n",
        "scala_code = \"\"\"\n",
        "object Hello {\n",
        "    def main(args: Array[String]) = {\n",
        "        println(\"Hello, world\")\n",
        "    }\n",
        "}\n",
        "\"\"\"\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generate_with_gpt2(gpt2_pipe, scala_code)\n",
        "\n"
      ],
      "metadata": {
        "id": "mEe0BQ5P-0ze",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 482
        },
        "outputId": "127fc42a-92aa-4ffb-e416-61d2507a951a",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " object Hello { def main(args: Array[String]) = { println(\"Hello, world\") } }\n",
            "\n",
            "To get the global state of the model in a function called Hello that evaluates it, you can use the functions array() and getInstance() to get the state of its parent, and also to get its state that has been updated.\n",
            "\n",
            "def getInstance(args): self.model = model def update(args, updates): return _model.load() if __name__ == '__main__': ... def main(): print(\"Hello world\") main()\n",
            "\n",
            "import random def myModel(args...): print(\"my model\") myModel = Random.class print(\"models: {:r} {:h} {:s}\") mymodel.setState(HelloWorld(\"world\"))\n",
            "\n",
            "As you can verify, I only used getInstance to get global state from the constructor of a main method that fires any subsequent calls to that method. So if you call a main() that fires myModel with updates as its argument (and your parent actually needs to call it to initialize the model), you get exactly the same data as if you had run the main method with updates loaded. But this isn't the only time I've observed this:\n",
            "\n",
            "print(\"Hello New world\") print (\"Hello Hello world\")\n",
            "\n",
            "The first time I invoked the main that had updates set to all members in my model (and fired getInstance , I don't know if this was due to the main or the update argument, which I can't verify), I saw the code that had just started. After a while, it got pretty obvious that we would need to call the main to get all members of our model. A little bit after this, when I tried my last attempt, my code got to show the main code running in parallel:\n",
            ":main = main() print(\"World.world()\") myWorld = new HelloWorld() myWorld.setId(Hello World.type) print(\"world.main()\")\n",
            " I've never seen this before for any of my model.\n",
            " (You won't notice that I've tried a different approach for this, but to give the reader that idea, please see my new post. You will also see how I am using the function getClass instead :main )\n",
            "\n",
            "You will notice how I call the loop's main method as well as the loop initialization and loop dispatch for each of the updates . Both events, which we'll see later, are handled directly by the models, so it doesn't matter what model you invoke, it's always just you.\n",
            "\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  object Hello { def main(args: Array[String]) =...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   object Hello { def main(args: Array[String]) ...          0.5796703  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-b3b2a0ee-dfea-4a8c-94fa-40765aee399f\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>object Hello { def main(args: Array[String]) =...</td>\n",
              "      <td>object Hello { def main(args: Array[String]) ...</td>\n",
              "      <td>0.5796703</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b3b2a0ee-dfea-4a8c-94fa-40765aee399f')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-b3b2a0ee-dfea-4a8c-94fa-40765aee399f button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-b3b2a0ee-dfea-4a8c-94fa-40765aee399f');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 30
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Python"
      ],
      "metadata": {
        "id": "2Cf2a0bP-_Kb",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Etq6PvoksJpL",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 240
        },
        "outputId": "67582a1e-d18a-48cc-b67c-a3ee2b5a98da",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " def my_func(a,b,c): return a+b+c\n",
            "\n",
            "The \"use_func\" function (also available locally, e.g., if the file argument should be \"my_functo\" , in the case of a directory), creates a new function that has been explicitly created: create_executable() by default. Using a separate function allows us to use the same syntax for any given function.\n",
            "\n",
            "Use the code in the above examples more carefully: to have the same effect for all functions you use, you have to use it with the same name. The code in here is taken from the code for \"Create a function: function\" in the directory that you put it in.\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                           document  \\\n",
              "0  def my_func(a,b,c): return a+b+c   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   def my_func(a,b,c): return a+b+c\\n\\nThe \"use_...          0.7029703  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-bfae99de-c0e1-46e5-a1f9-2bacbc23f0ae\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>def my_func(a,b,c): return a+b+c</td>\n",
              "      <td>def my_func(a,b,c): return a+b+c\\n\\nThe \"use_...</td>\n",
              "      <td>0.7029703</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-bfae99de-c0e1-46e5-a1f9-2bacbc23f0ae')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-bfae99de-c0e1-46e5-a1f9-2bacbc23f0ae button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-bfae99de-c0e1-46e5-a1f9-2bacbc23f0ae');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 31
        }
      ],
      "source": [
        "# adding 3 numbers in python\n",
        "py_code = \"\"\"\n",
        "def my_func(a,b,c):\n",
        "  return a+b+c\n",
        "\"\"\"\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generate_with_gpt2(gpt2_pipe, py_code)\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4f6VP6P72e6r",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 621
        },
        "outputId": "ca13edd7-019a-45fe-c976-a613329a324c",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________\n",
            " def bfs(visited, graph, node): visited.append(node) queue.append(node) while queue: s = queue.pop(0) print (s, end = \" \") for neighbour in graph[s]: if neighbour not in visited: visited.append(neighbour) queue.append(neighbour) return queue.split(self.visited_node) else: s.append_to(node), neighbour.append to(neighed_trail, bfs.reverse()) queue.splice(self._visited).substr(1, 1)\n",
            "\n",
            "Skeletal.render() .render(a = {}) .render() + (a = {});\n",
            "\n",
            "When building on a new source of data, we can reuse our previous build and add new views. The following example has no view, but some methods to render items, which then need to update the view later.\n",
            "\n",
            "@Override public void render(List<Item<Void> > v) { System.out.println(\"Items\"); Void.add(v); } @Override public List<Item> getTasks(list<Item<?> nodes) -> List<> getSelectedItems() -> List<?> getNestedItems() { List<Id> items = nodes.get(0), new_item = new_items.first().split('', ''), sorted = items.length, new_markup = items[0].remove(), } @Nullable public Void createTasks() { Void nodes = findTasks().get(1); try { node_remove = new Void(node.name, node.size), nodes.appendNode(), new_name =node_remove.getTag(node_name, newNode); } catch (e) {} System.error.println( \"Unable to create queues for: \" e.name) } return null }\n",
            "\n",
            "The above example builds a list of items and lists one by one to see which queue.push is at which point.\n",
            " \" : \" { \" \" 0 \" 0 0 0 \" . get(\"list\", new_count = nil)) : \" [ { \" time \" : new_list, \" label \" : name})\n",
            "\n",
            "You'll note that the map is written like this:\n",
            "\n",
            ". Map the items here. Each item is a sub-item. The map will be of the order it is found. The item is not pushed onto the map, it comes back.\n",
            " —-> []\n",
            "\n",
            "Using the Map methods directly\n",
            "\n",
            "We use Map(List <Item<Node> from inside list.Map to get out a map object that's always in use) to use the MapBuilder from inside the map class.\n",
            " (Note: we're using MapBuilder so that it may not use Map for our MapList class because there's an uninitialized, uninitialized MapList interface in CodeRunner.cs; this is a bug, but they get fixed very soon...)\n",
            "\n",
            "# class MapBuilder @Map <Item>, class List <Node> { ... Map n = new Map<>(); ... List n { int index; // initializes to n index = '\\0'; for(int i = 0; i < 1; i++){ Ns ns(s, index); return n; } } }\n",
            ".Map(n = new List<int>()); .Map(newNode = newMap.Alloc({ n : n});)\n",
            " \" = { \" map: \" map.set(newList, newList.alloc()) } .map(newPath = newList))\n",
            "\n",
            "This is the single best way to render a Tree class in Code Runner: it automatically creates a new Tree tree for your class. This makes it easy to write \"unfolded-tree\" files into a Code Runner.\n",
            "\n",
            "\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                            document  \\\n",
              "0  def bfs(visited, graph, node): visited.append(...   \n",
              "\n",
              "                                           generated  unique_word_score  \n",
              "0   def bfs(visited, graph, node): visited.append...          0.6372549  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-866dbef6-3842-42e1-92b4-6b5bc73974f1\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "      <th>unique_word_score</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>def bfs(visited, graph, node): visited.append(...</td>\n",
              "      <td>def bfs(visited, graph, node): visited.append...</td>\n",
              "      <td>0.6372549</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-866dbef6-3842-42e1-92b4-6b5bc73974f1')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-866dbef6-3842-42e1-92b4-6b5bc73974f1 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-866dbef6-3842-42e1-92b4-6b5bc73974f1');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 32
        }
      ],
      "source": [
        "# breadth-first search \n",
        "bfs_python_code = \"\"\"\n",
        "def bfs(visited, graph, node):\n",
        "  visited.append(node)\n",
        "  queue.append(node)\n",
        "\n",
        "  while queue:\n",
        "    s = queue.pop(0) \n",
        "    print (s, end = \" \") \n",
        "\n",
        "    for neighbour in graph[s]:\n",
        "      if neighbour not in visited:\n",
        "        visited.append(neighbour)\n",
        "        queue.append(neighbour)\n",
        "\"\"\"\n",
        "\n",
        "\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(1000)\n",
        "gpt2_pipe['gpt2'].setTemperature(1)\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "generate_with_gpt2(gpt2_pipe, bfs_python_code)\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S5wou4nt0FDf",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "# Data Augmentation\n",
        "Analogous to how Machine Vision Data Scientists can augment their datasets and models by generating new traning examples, there exist many techniques in NLP to achieve the same goal.\n",
        "\n",
        "- This is especially useful to create smaller and faster models than the Large Language model we will be using for this. \n",
        "- Also since the Large Language Model is eduacted on a variety of topics by training on huge corpuses on text, we can leverage it to generate text in settings where we do not have a lot of access to data or a class imbalance \n",
        "\n",
        "\n",
        "This segment demonstrates one of possible many approaches and gives a simple re-usable framework to explore other methods. \n",
        "\n",
        "\n",
        "see [A Survey of Data Augmentation Approaches for NLP](https://arxiv.org/pdf/2105.03075.pdf) for a detailed breakdown and analysis of many other Augmentation Methods.\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Data Augmentation example for news Classifier Dataset\n",
        "\n",
        "We will use 90% of the data of testing and only 10% for training to similate a scenario where we would have access to little data"
      ],
      "metadata": {
        "id": "TFjOuZIDyRLK",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "B2MaOzxW0G_T",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "%%capture \n",
        "! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv\n",
        "! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv\n",
        "from sklearn.metrics import classification_report\n",
        "from sklearn.model_selection import train_test_split\n",
        "import pandas as pd\n",
        "\n",
        "test_path = '/content/news_category_test.csv'\n",
        "base_dataset = pd.read_csv(test_path).iloc[:10000]\n",
        "base_dataset.columns=['y','text']"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "base_dataset.y.value_counts().plot.bar()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 318
        },
        "id": "-VwzZ7McFc3x",
        "outputId": "1a0c24f3-e2bc-4e56-b6d6-003d0ccb2402",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<matplotlib.axes._subplots.AxesSubplot at 0x7fb7e3e0c410>"
            ]
          },
          "metadata": {},
          "execution_count": 17
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAEbCAYAAAA21FQWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAXZElEQVR4nO3df7BkdXnn8ffHGSVGUTDcZYFhHGRHE3B1lBHxV1bXRIFkRV1LmdoIUeNgKWt03bLAVC3GhJQVNa5uDO4YEVEDQogRd0FAorhxRR1wwg8DOiAsMxlhBAUXXSLDs3+cc6Ud78zc291zm57v+1XVdbufc7r7uV0zn3v6e77nnFQVkqQ2PGzSDUiSFo+hL0kNMfQlqSGGviQ1xNCXpIYY+pLUkKWTbmBX9ttvv1qxYsWk25CkqXHVVVd9v6pm5lr2kA/9FStWsH79+km3IUlTI8mtO1rm8I4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIQ/5g7N2hxWn/M9Jt7BLt7z7tybdwrxMw2cJfp7j5uc5Pov9WbqlL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SG7DL0k5yZ5I4k1w3UPp1kQ3+7JcmGvr4iyU8Gln144DlHJLk2ycYkH0yS3fMrSZJ2ZD5n2TwL+HPg7NlCVb1q9n6S9wF3D6x/U1WtmuN1zgBeD3wNuAg4Grh44S1Lkoa1yy39qvoycNdcy/qt9VcC5+zsNZIcADymqq6sqqL7A/LShbcrSRrFqGP6zwNur6rvDNQOSfLNJFckeV5fOwjYNLDOpr4mSVpEo15EZQ0/v5W/BVheVXcmOQL42ySHL/RFk6wF1gIsX758xBYlSbOG3tJPshR4OfDp2VpV3VdVd/b3rwJuAp4IbAaWDTx9WV+bU1Wtq6rVVbV6ZmZm2BYlSdsZZXjnN4AbqupnwzZJZpIs6e8/AVgJ3FxVW4B7khzV7wc4AfjsCO8tSRrCfKZsngN8FXhSkk1JXtcvOp5f3IH768A1/RTOvwbeUFWzO4HfCPwlsJHuG4AzdyRpke1yTL+q1uyg/rtz1C4ALtjB+uuBJy+wP0nSGHlEriQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0JekhsznGrlnJrkjyXUDtXcm2ZxkQ387dmDZqUk2JrkxyYsH6kf3tY1JThn/ryJJ2pX5bOmfBRw9R/39VbWqv10EkOQwugumH94/5y+SLEmyBPgQcAxwGLCmX1eStIjmc2H0LydZMc/XOw44t6ruA76bZCNwZL9sY1XdDJDk3H7dby24Y0nS0EYZ0z85yTX98M++fe0g4LaBdTb1tR3VJUmLaNjQPwM4FFgFbAHeN7aOgCRrk6xPsn7r1q3jfGlJatpQoV9Vt1fVtqp6APgIDw7hbAYOHlh1WV/bUX1Hr7+uqlZX1eqZmZlhWpQkzWGo0E9ywMDDlwGzM3suBI5PsleSQ4CVwNeBbwArkxyS5BF0O3svHL5tSdIwdrkjN8k5wPOB/ZJsAk4Dnp9kFVDALcBJAFV1fZLz6HbQ3g+8qaq29a9zMnAJsAQ4s6quH/tvI0naqfnM3lkzR/mjO1n/dOD0OeoXARctqDtJ0lh5RK4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIbsMvSTnJnkjiTXDdTek+SGJNck+UySffr6iiQ/SbKhv3144DlHJLk2ycYkH0yS3fMrSZJ2ZD5b+mcBR29Xuwx4clU9Bfg2cOrAspuqalV/e8NA/Qzg9cDK/rb9a0qSdrNdhn5VfRm4a7vapVV1f//wSmDZzl4jyQHAY6rqyqoq4GzgpcO1LEka1jjG9F8LXDzw+JAk30xyRZLn9bWDgE0D62zqa5KkRbR0lCcn+QPgfuBTfWkLsLyq7kxyBPC3SQ4f4nXXAmsBli9fPkqLkqQBQ2/pJ/ld4LeB/9AP2VBV91XVnf39q4CbgCcCm/n5IaBlfW1OVbWuqlZX1eqZmZlhW5QkbWeo0E9yNPB24CVV9eOB+kySJf39J9DtsL25qrYA9yQ5qp+1cwLw2ZG7lyQtyC6Hd5KcAzwf2C/JJuA0utk6ewGX9TMvr+xn6vw68K4kPwUeAN5QVbM7gd9INxPokXT7AAb3A0iSFsEuQ7+q1sxR/ugO1r0AuGAHy9YDT15Qd5KksfKIXElqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakh8wr9JGcmuSPJdQO1xyW5LMl3+p/79vUk+WCSjUmuSfL0geec2K//nSQnjv/XkSTtzHy39M8Cjt6udgpweVWtBC7vHwMcA6zsb2uBM6D7IwGcBjwTOBI4bfYPhSRpccwr9Kvqy8Bd25WPAz7e3/848NKB+tnVuRLYJ8kBwIuBy6rqrqr6AXAZv/iHRJK0G40ypr9/VW3p738P2L+/fxBw28B6m/rajuq/IMnaJOuTrN+6desILUqSBo1lR25VFVDjeK3+9dZV1eqqWj0zMzOul5Wk5o0S+rf3wzb0P+/o65uBgwfWW9bXdlSXJC2SUUL/QmB2Bs6JwGcH6if0s3iOAu7uh4EuAV6UZN9+B+6L+pokaZEsnc9KSc4Bng/sl2QT3SycdwPnJXkdcCvwyn71i4BjgY3Aj4HXAFTVXUn+CPhGv967qmr7ncOSpN1oXqFfVWt2sOiFc6xbwJt28DpnAmfOuztJ0lh5RK4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYMHfpJnpRkw8DtniRvSfLOJJsH6scOPOfUJBuT3JjkxeP5FSRJ8zWva+TOpapuBFYBJFkCbAY+Q3ch9PdX1XsH109yGHA8cDhwIPCFJE+sqm3D9iBJWphxDe+8ELipqm7dyTrHAedW1X1V9V1gI3DkmN5fkjQP4wr944FzBh6fnOSaJGcm2bevHQTcNrDOpr4mSVokI4d+kkcALwHO70tnAIfSDf1sAd43xGuuTbI+yfqtW7eO2qIkqTeOLf1jgKur6naAqrq9qrZV1QPAR3hwCGczcPDA85b1tV9QVeuqanVVrZ6ZmRlDi5IkGE/or2FgaCfJAQPLXgZc19+/EDg+yV5JDgFWAl8fw/tLkuZp6Nk7AEkeBfwmcNJA+U+TrAIKuGV2WVVdn+Q84FvA/cCbnLkjSYtrpNCvqnuBX9mu9uqdrH86cPoo7ylJGp5H5EpSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JasjIoZ/kliTXJtmQZH1fe1ySy5J8p/+5b19Pkg8m2ZjkmiRPH/X9JUnzN64t/RdU1aqqWt0/PgW4vKpWApf3jwGOAVb2t7XAGWN6f0nSPOyu4Z3jgI/39z8OvHSgfnZ1rgT2SXLAbupBkrSdcYR+AZcmuSrJ2r62f1Vt6e9/D9i/v38QcNvAczf1NUnSIlg6htd4blVtTvIvgMuS3DC4sKoqSS3kBfs/HmsBli9fPoYWJUkwhi39qtrc/7wD+AxwJHD77LBN//OOfvXNwMEDT1/W17Z/zXVVtbqqVs/MzIzaoiSpN1LoJ3lUkr1n7wMvAq4DLgRO7Fc7Efhsf/9C4IR+Fs9RwN0Dw0CSpN1s1OGd/YHPJJl9rb+qqs8n+QZwXpLXAbcCr+zXvwg4FtgI/Bh4zYjvL0lagJFCv6puBp46R/1O4IVz1At40yjvKUkankfkSlJDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUEENfkhpi6EtSQwx9SWqIoS9JDTH0Jakhhr4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqyNChn+TgJF9M8q0k1yf5/b7+ziSbk2zob8cOPOfUJBuT3JjkxeP4BSRJ8zfKNXLvB95WVVcn2Ru4Ksll/bL3V9V7B1dOchhwPHA4cCDwhSRPrKptI/QgSVqAobf0q2pLVV3d3/8R8I/AQTt5ynHAuVV1X1V9F9gIHDns+0uSFm4sY/pJVgBPA77Wl05Ock2SM5Ps29cOAm4beNomdv5HQpI0ZiOHfpJHAxcAb6mqe4AzgEOBVcAW4H1DvObaJOuTrN+6deuoLUqSeiOFfpKH0wX+p6rqbwCq6vaq2lZVDwAf4cEhnM3AwQNPX9bXfkFVrauq1VW1emZmZpQWJUkDRpm9E+CjwD9W1Z8N1A8YWO1lwHX9/QuB45PsleQQYCXw9WHfX5K0cKPM3nkO8Grg2iQb+to7gDVJVgEF3AKcBFBV1yc5D/gW3cyfNzlzR5IW19ChX1V/D2SORRft5DmnA6cP+56SpNF4RK4kNcTQl6SGGPqS1BBDX5IaYuhLUkMMfUlqiKEvSQ0x9CWpIYa+JDXE0Jekhhj6ktQQQ1+SGmLoS1JDDH1JaoihL0kNMfQlqSGGviQ1xNCXpIYY+pLUkEUP/SRHJ7kxycYkpyz2+0tSyxY19JMsAT4EHAMcBqxJcthi9iBJLVvsLf0jgY1VdXNV/TNwLnDcIvcgSc1KVS3emyWvAI6uqt/rH78aeGZVnbzdemuBtf3DJwE3LlqTw9kP+P6km9iD+HmOl5/neE3D5/n4qpqZa8HSxe5kPqpqHbBu0n3MV5L1VbV60n3sKfw8x8vPc7ym/fNc7OGdzcDBA4+X9TVJ0iJY7ND/BrAyySFJHgEcD1y4yD1IUrMWdXinqu5PcjJwCbAEOLOqrl/MHnaTqRmKmhJ+nuPl5zleU/15LuqOXEnSZHlEriQ1xNCXpIYY+pLUEENf2oMl2TfJUybdhx463JE7pCS/D3wM+BHwl8DTgFOq6tKJNjbFkjwbWMHArLKqOntiDU2pJF8CXkL3OV4F3AF8par+0yT7mjZJPgfsMCCr6iWL2M7YPCSPyJ0Sr62qDyR5MbAv8GrgE4ChP4QknwAOBTYA2/pyAYb+wj22qu5J8nvA2VV1WpJrJt3UFHpv//PlwL8EPtk/XgPcPpGOxsDQH176n8cCn6iq65NkZ0/QTq0GDiu/eo7D0iQHAK8E/mDSzUyrqroCIMn7tjvtwueSrJ9QWyNzTH94VyW5lC70L0myN/DAhHuaZtfRbU1pdH9IdwDkxqr6RpInAN+ZcE/T7FH9ZwhAkkOAR02wn5E4pj+kJA8DVgE3V9UPkzwOWFZVfo1egIFx073pPs+vA/fNLp/WcdNJSvKcqvrKrmqanyRH0x2FezPdN/zHAydV1SUTbWxIhv6QkjwH2FBV9yb5HeDpwAeq6tYJtzZVkvybnS2f/Yqt+UtydVU9fVc1zV+SvYBf7R/eUFX37Wz9hzLH9Id3BvDUJE8F3kY3g+dsYKchpp83MG56CLClqv5f//iRwP6T7G3aJHkW8GxgJsngTJ3H0J3rSguQ5OU7WHRoEqrqbxa1oTEx9Id3f1VVkuOAP6+qjyZ53aSbmmLn0wXWrG197RmTaWcqPQJ4NN3/670H6vcAr5hIR9Pt3+1kWQGGfmN+lORUuqmaz+vH+B8+4Z6m2dL+EpoAVNU/96ff1jxV1RVJ/h54SlX94aT7mXZV9Zr+//Urquq8SfczLs7eGd6r6HY4vraqvkd3QZj3TLalqbY1yc922vbfoB7ql6R7yKmqbcCBk+5jT1FVDwBvn3Qf4+SO3BEkeTywsqq+kOSXgSVV9aNJ9zWNkhwKfAo4iO6r8ybghKraONHGplCSM+g+x/OBe2fr0zoGPWlJ3k23AfJpfv7zvGtiTY3A0B9SktfTXbz9cVV1aJKVwIer6oUTbm2qJXk0QFX930n3Mq2SfGyOclXVaxe9mT1Aku/OUa6qesIc9Yc8Q39ISTYARwJfq6qn9bVrq+pfT7az6ZRkf+BPgAOr6pgkhwHPqqqPTrg1aY/imP7w7hvc8ZhkKTs5OZN26Sy6o0hnx6O/DbxlYt1MsSTLknwmyR397YIkyybd17RK8vAkb07y1/3t5CRTO2nD0B/eFUneATwyyW/SjZ9+bsI9TbP9+hkSD0B3PWUePPGaFuZjwIV0f0APpPt3OdeQj+bnDOAI4C/62xF9bSo5ZXN4pwCvA64FTgIuojtASwuQZGkf8Pcm+RX6b0tJjgLunmhz02umqgZD/qwkfmsa3jOq6qkDj/8uyT9MrJsRGfpD6qdyfaS/aXhfpzuFxdvotk4PTfIVYAYPKBrWnf2pQc7pH68B7pxgP9NuW5JDq+omgP7ka1P7LdQduUPqz73zTrqTLy2lOxHT1O7Rn5Qk3xzYEb4UeBLdZ3ljVf10os1NqX4q8X8DntWXvgK8uar+z+S6mj79t6P/DexDt3E3O4tnBd3xOX83odZGYugPKckNwFvprkz0s7/6VeUW1QIk2QT82Y6WV9UOl0m7U5L30p0a5NfoTk29CfgicEFV/dMkexuFwzvDu7uqLp50E3uAJXTni/ECNGPSDz98ADiKbh/JV4G3VtXNE21sylTVfwboTweymu4PwPOBU5P8sKoOm2B7QzP0h/fFJO+hO+nS4Pnfr55cS1Ppe1X1rkk3sYf5K+BDwMv6x8fTje8/c2IdTbdH0p2p9LH97Z/oJnBMJYd3hpTki3OUq6r+7aI3M8U8z/v4Jbmmqp6yXe0ftpuBol1Isg44HPgR8DXgSuDKqvrBRBsbkVv6Q6qqF0y6hz3EPkk+AFwMfGn2fPoaycVJTgHOpRveeRVwUX91t6k9Z8wELAf2ohvP30w3pv/DiXY0Bm7pL1CS36mqT253kYqfccfjwvQzdp4LHA28gG5q4SXAxVX17Un2Nq0GzhUz+597cH+JM8wWIEnotvaf3d+eDNwFfLWqTptkb8NyS3/hZi+IvPdO19K89Admfam/keRAuj8Af5zkX9F9nX7jxBqcIkmeAdxWVYf0j08E/j1wC/BOt/AXrrqt4uuS/JDuYMG7gd+mO+/WVIa+W/p6yOovYPEsL+g9P0muBn6jqu5K8ut0wzv/ke6C879WVR7stgBJ3syDW/g/pZuzP3u7tj9Ac+q4pT+kJH8K/DHwE+DzwFPopsV9cqKNTZkk/7Wq3pLkc8xxwrqqeskcT9Pclgxszb8KWFdVFwAX9GeF1cKsoDun1lurasuEexkbQ394L6qqtyd5Gd3X55cDXwYM/YX5RP/zvRPtYs+wZOBcRi+ku97DLP+vL1BVzbnfbtr5D2F4s5/dbwHnV9Xd3T4fLURVXdXfXQ/8ZPYrc5IldDMnNH/n0J399ft030D/F0C/b8ST1wnw1Mqj+B/9qRiOAC5PMgM43XB4lwO/PPD4kcAXJtTLVKqq0+lOXHcW8Nx6cIfdw+jG9iV35I6in/d8d1Vt66+R+5j+IulaoCQbqmrVrmqSRuPwzpCSnDBwf3DR2YvfzR7h3iRPnz2NRZLVdEMUksbI0B/eMwbu/xLdjrOrMfSH9Rbg/CSzZy88gG4GiqQxcnhnTJLsA5xbVUdPupdpMnBA0ff6646eRDcT6lvAf/GAImm83JE7PvcCh0y6iSn034HZC8w/C3gH3RkifwCsm1RT0p7K4Z0hbXcw0cOAw4DzJtfR1PKAImkRGfrDGzyY6H7g1qraNKlmppgHFEmLyP9UQ6qqK2bvJ9kPLzw9LA8okhaRO3IXKMlRwLvpTq/6R3SnEdiPbojnhKr6/ATbm0r9Z3oAcGlV3dvXngg82iuRSeNl6C9QkvV0OxsfS7ej8ZiqujLJrwLnVNXTJtqgJO2Es3cWbmlVXVpV59Nd3/VKgKq6YcJ9SdIuGfoLN3gO7e2PGPVrk6SHNId3FijJNro5+aE7KdiPZxcBv1RVD59Ub5K0K4a+JDXE4R1JaoihL0kNMfQlqSGGviQ1xNCXpIb8f5gU+BQ49XRPAAAAAElFTkSuQmCC\n"
          },
          "metadata": {
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "base_dataset.head(5)"
      ],
      "metadata": {
        "id": "gaGnBSYEFwrH",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "outputId": "157cc858-1945-442b-e324-1c83fc58dcf2",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "          y                                               text\n",
              "0  Business  Unions representing workers at Turner   Newall...\n",
              "1  Sci/Tech   TORONTO, Canada    A second team of rocketeer...\n",
              "2  Sci/Tech   A company founded by a chemistry researcher a...\n",
              "3  Sci/Tech   It's barely dawn when Mike Fitzpatrick starts...\n",
              "4  Sci/Tech   Southern California's smog fighting agency we..."
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-e3ebb9c5-07d8-40a8-95c8-37d3b549718d\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business</td>\n",
              "      <td>Unions representing workers at Turner   Newall...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>TORONTO, Canada    A second team of rocketeer...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>A company founded by a chemistry researcher a...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>It's barely dawn when Mike Fitzpatrick starts...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Southern California's smog fighting agency we...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e3ebb9c5-07d8-40a8-95c8-37d3b549718d')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-e3ebb9c5-07d8-40a8-95c8-37d3b549718d button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-e3ebb9c5-07d8-40a8-95c8-37d3b549718d');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import nlu\n",
        "gpt2_pipe = nlu.load('gpt2')\n",
        "gpt2_pipe"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hTPbUvw9YVwV",
        "outputId": "549e3088-3777-4b49-de70-2a95f3a4cef2",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "gpt2 download started this may take some time.\n",
            "Approximate size to download 442.7 MB\n",
            "[OK!]\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{'gpt2': GPT2TRANSFORMER_b38120f8fb6b,\n",
              " 'document_assembler': DocumentAssembler_63df80543ee3}"
            ]
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Vanilla Training"
      ],
      "metadata": {
        "id": "sf4pMd1yL2Xs",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Base Dataset Preperation"
      ],
      "metadata": {
        "id": "Wfm4uDtCIuQO",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Let's make sure the label is equeally distributed in train/test\n",
        "# We take 500 in total for each class in our base dataset\n",
        "n_per_class_total = 1000\n",
        "base_dataset = base_dataset.groupby('y').head(n_per_class_total)\n",
        "# We use a 0.85 test/train split to simulate a low data ressource scenario\n",
        "# We draw equally from each labell so data stays balanced\n",
        "train_test_frac = 0.90\n",
        "train_dfs = []\n",
        "test_dfs = []\n",
        "for label in base_dataset.y.unique():\n",
        "  train_df, test_df = train_test_split(base_dataset[base_dataset.y == label],test_size = train_test_frac)\n",
        "  test_dfs.append(test_df)\n",
        "  train_dfs.append(train_df)\n",
        "\n",
        "\n",
        "test_df = pd.concat(test_dfs)\n",
        "train_df = pd.concat(train_dfs)"
      ],
      "metadata": {
        "id": "3xTgJKBNGEs8",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OBBf2Yx96C5S",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 334
        },
        "outputId": "d0a60b11-8982-4361-a41f-29362031e6df",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<matplotlib.axes._subplots.AxesSubplot at 0x7fb7e3cfd050>"
            ]
          },
          "metadata": {},
          "execution_count": 19
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAErCAYAAAAljMNyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAdqElEQVR4nO3debwcZZ3v8c+XHHYCIeSIJAGCbAqOCIZdHEa8IygI4+UCihoVRUcRRecKLiPogBeVUXBUNIjsu4iAgKIIuIygAVFkk4AsCYQcIAkhLBL43T+ep02n7T59+nTnFP3M9/169et01/o71dXfrnqqukoRgZmZlWWlqgswM7Pec7ibmRXI4W5mViCHu5lZgRzuZmYFcribmRXI4d6GpNMlHdvjae4uac5Yj1s3jaskzehmGnXT2k3SXXWv75P0hl5MO0/vNkm792p6ddPt2TLo5xr+J5J0jKSzq65jReubcM+h8bSkJ+se36i4pndL+lWVNTSSFJKW5OXzmKRrJB1YP0xE7BURZ4xwWpsNN0xE/DIituy27jy/v/sijYitI+K6Xky/YbojWgaNGta/FxrWyYPHooZcR+3zsFjSQkn/LemDkkb0mZY0Lb+/A6OZfwd1djwfSddJet+KGn5Fk/S9xs+OpImSLsmfzfslvb1hnLfn7ksk/VDSxG7r6Jtwz/aJiLXqHodVXdCL1DYRsRawJXA68A1JR/d6Jis6GF6M6tc/4AGWXyfPqQ03Rstmn4gYD2wMHA8cCZw6BvO1FiS9Fti0Sa9vAn8F1gcOBk6WtHUeZ2vgO8A7c/+ngG91XUxE9MUDuA94Q4t+44ATgEeBe4EPAwEMNBsXOAY4u+71RcA8YBHwC2Drun6nA8e2mO+7gV+16Pce4A5gca7pA3X9dgfmAJ/ONd8HHFzXf9X8/zwAPAJ8G1i9ftxhllMAmzV02x94Blgvv74OeF9+vhlwff7fHwUuyN1/kae1BHgSOLCu7iPz8jqrsZ78v3wKuB1YAJwGrNZqedXqBQ4FniN9AJ4ELm987/JyORF4KD9OBFZtWKafAOYDDwPvGWY51S+DdwO/yst8AfAXYK9O1skWy2Zd4EfAUJ7uj4CpvaiBJp8HYAfgBeCV+fWbgd8DTwAPAsfUDftAXvZP5sfOpFD6OfBYXhfOASbUjXMkMJe0Tt8F7JG7rwQcBdyTx70QmNhqPm2W6XHA86T19UngG7n7LsDvSOvp74Bd2gx/Uv6fnwBuAnZr9fnvYUYN5OX9Kuo+h8CapPV6i7phzwKOz8+/CJxb12/TPPz4burpty33Vt4P7A1sC0wnhVknrgI2B14C3Exaqbs1P9e0NinovyZpu7r+LwUmAVOAGcBMSbXmjeOBLYBXk4JvCvC5Lmq5lLTi7dCk338AV5OCaCrwXwAR8brcf5tIW6UX1NU9kbS1eGiL+R0MvJG0km4BfLZdgRExk7Tcv5znt0+TwT4D7ERaLtvk/6d+2i8F1iEtr0OAb0pat928sx1JgTUJ+DJwqiSNcNz6+dcvm5VIX24bAxsBTwPDNSV2VUNE/Jb0BbNb7rQEeBcwgRT0/yppv9yv9v5OyMv7N4CA/wdMBl4BbEgKQvK6eRiwfaS9hTeSvmAAPgLsB/xjHncBaUu11XyG+x8+A/wSOCwPf1huorgC+DqwHvBV4ApJ6zUbPk/qd6T1ZCJwLnCRpNXaLEIkbZSbuVo93j7M6EcAv4iIPzZ03wJYGhF/ruv2B2Dr/Hzr/Lq2DO4hfxm0q3c4/RbuP2xY0O/P3Q8AToyIByPicdIKOmIR8b2IWBwRz5JW5m0krdNNoRFxRUTcE8n1pADdrWGwf4+IZ3P/K4AD8of5UOCIiHg8IhaTvtkP6qKW50hbYs3a8Z4jhc/kiHgmItodQ3gBODrX/XSLYb5R914cB7xttLU3OBj4QkTMj4gh4POkXdma53L/5yLiStKW3EiPB9wfEadExPPAGcAGpF3kTiy3bCLisYi4OCKeyu/jcaQAXJE1PER+nyPiuoi4NSJeyIFz3nDzj4jZEfHTXP8QKURrwz9P2nPaStLKEXFfDiGADwKfiYg5dZ+h/XvYNPVm4O6IOCsilkbEecCdQLMNgNr/cnZe/ksj4j9z7W3XhYh4ICImDPM4t9l4kjYEPkDzjbC1SHsQ9RYB4+v6Lxqm/6j0W7jv17CgT8ndJ5N2wWruH+kEJY2TdLykeyQ9wbKtkUndFCppL0k3SHpc0kLgTQ3TXBARSxpqngwMAmsAN9W+xIAf5+6jrWXlPP7jTXp/krTF9tt8Zsp720xuKCKeaTNM43sxecTFDm8yy7+3jdN+LCKW1r1+ivTBGYl5tScR8VR+OtJxa5ZbNpLWkPSdfKDsCVJT1wRJ41ZgDVPI77OkHSVdK2lI0iJSCLdcryWtL+l8SXNzvWfXho+I2cDHSME9Pw9XW/YbA5fUra93kL4MOv1iaqXxfSe/njLM//Jvku6QtCjXtA5dfqbbOJG0YdEY0pA2MtZu6LY2qXlrJP1Hpd/CvZWHSbuQNRs19F9CCsyal9Y9fzuwL/AG0gowLXfvdJf8byStClxMaj9dPyImAFc2THNdSWs21PwQaQv7aVK7f+1LbJ1IB/BGa19gKfDbxh4RMS8i3h8Rk0lbHt9qc4bMSC4j2vhePJSfL/c+SKp/H0Yy7YdIQdJs2i8GjfV/grS1uGNErM2yJopRr1vDkbQ9KfBqe1/nApcBG0bEOqRjN7V5N1vWX8zd/yHX+476WiPi3Ih4Lek9COBLudeDpOMD9Rteq0XE3BbzaadxnMb3HdJ7P7fZ8JJ2I220HACsmz9/ixjBcs/NMk8O82h1RtQewFckzZNU+5L+TW7G+TMwIGnzuuG3AW7Lz2/Lr2s1vIy0p1HfjNOxUsL9QuBwSVNzG+tRDf1vAQ6StLKkxjb58cCzpANBa5BW8E5I0mr1D2AV0pszBCyVtBfwz03G/bykVfLKuDdwUUS8AJxCaqN/SZ7BFElv7LCu2ulXB5PaP78UEY81Geb/SJqaXy4gfVBeyK8fAV7W6XyBD+f3YiKpnbzWXv8HYGtJr87L6ZiG8drN7zzgs5IGJU0i7QK/mM9XHk/6ol6Yl0XPz1gCkLS2pL2B80kHCm+tm//jEfGMpB1IGzI1Q6T3uX55jydtRS6SNAX4v3Xz2FLS6/OGyzP5/6qtJ98GjpO0cR52UNK+reajZadHTmvxLzWuB1cCWyidLjigdGrvVqQD1M2GH0/amBkihern+Pst46Zys8xawzxaHY/bghTQr84PSM1Gl+Q99B8AX5C0pqRdSRtcZ+XhzgH2UfrNyJrAF4Af5Ka8Ueu3cL+84Vv0ktz9FOAnpPC4mbQg6/076eDeAlI7bX272ZmkXby5pDM8buiwpl1IK3rj43DSl84C0ofqsobx5uV+D5He3A9GxJ2535HAbOCGvHv8M0bedgzwB0lP5mm8j9R+3+qA7PbAjXn4y4CPRsS9ud8xwBl5d/uADuZ/LukYw72kMyiOBcgHlL6Q/5+7WbaFWXMqqU13oaQfNpnuscAs4I/AraT3uqc/MOuxE4HVSXtjN5Ca13rpckmLSVvOnyG1kb+nrv+HSIGymPRFeGGtR272OQ74dV7eO5E+G9uRtnKvYPnP0aqkA/2Pktbdl5DOioJ0ZsplwNV5XjeQDg63ms+GLPvMNXMSqc1+gaSv542SvUl7Qo+Rtsr3johHmw1PyoIfk7Z87yd9GT3YOJNeyseB5tUeufOjdcelPkRaF+aTNlL+NSJuy+PeRmoyOyf3H5+H74oiyrtZR94i+AuwckMbrJlVTNJnSccnvlN1LSVzuJuZFajfmmXMzGwEitxyNzP7n85b7mZmBXK4m5kV6EVxVb9JkybFtGnTqi7DzKyv3HTTTY9GRNNfr78own3atGnMmjWr6jLMzPqKpJaXWnGzjJlZgRzuZmYFcribmRXI4W5mViCHu5lZgdqGu9KdvOdL+lNdt4mSfirp7vx33dxdkr4uabakP2r528qZmdkYGcmW++nAng3djgKuiYjNgWtYdv30vUj3It2cdKu4k3tTppmZdaJtuEfEL/j727PtS7rHI/nvfnXdz4zkBtItxTboVbFmZjYyo/0R0/oR8XB+Po9l90qcwvIXxZ+Tuz1MA0mHkrbu2WijxrvidW/aUVf0fJorwn3Hv7nqEkbEy7N3vCx7y8uzua4PqEa6rGTHl5aMiJkRMT0ipg8Ojvrez2Zm1sRow/2RWnNL/js/d5/L8jdHnkrrW2mZmdkKMtpwvwyYkZ/PAC6t6/6ufNbMTsCiuuYbMzMbI23b3CWdB+wOTJI0h3QH9+OBCyUdQroBbe3myVcCbyLdmPkplr9Zr5mZjZG24R4Rb2vRa48mwwbw4W6LMjOz7vgXqmZmBXK4m5kVyOFuZlYgh7uZWYEc7mZmBXK4m5kVyOFuZlYgh7uZWYEc7mZmBXK4m5kVyOFuZlYgh7uZWYEc7mZmBXK4m5kVyOFuZlYgh7uZWYEc7mZmBXK4m5kVyOFuZlYgh7uZWYEc7mZmBXK4m5kVyOFuZlYgh7uZWYEc7mZmBXK4m5kVyOFuZlYgh7uZWYEc7mZmBXK4m5kVyOFuZlYgh7uZWYG6CndJR0i6TdKfJJ0naTVJm0i6UdJsSRdIWqVXxZqZ2ciMOtwlTQEOB6ZHxCuBccBBwJeAr0XEZsAC4JBeFGpmZiPXbbPMALC6pAFgDeBh4PXA93P/M4D9upyHmZl1aNThHhFzgROAB0ihvgi4CVgYEUvzYHOAKd0WaWZmnemmWWZdYF9gE2AysCawZwfjHypplqRZQ0NDoy3DzMya6KZZ5g3AXyJiKCKeA34A7ApMyM00AFOBuc1GjoiZETE9IqYPDg52UYaZmTXqJtwfAHaStIYkAXsAtwPXAvvnYWYAl3ZXopmZdaqbNvcbSQdObwZuzdOaCRwJfFzSbGA94NQe1GlmZh0YaD9IaxFxNHB0Q+d7gR26ma6ZmXXHv1A1MyuQw93MrEAOdzOzAjnczcwK5HA3MyuQw93MrEAOdzOzAjnczcwK5HA3MyuQw93MrEAOdzOzAjnczcwK5HA3MyuQw93MrEAOdzOzAjnczcwK5HA3MyuQw93MrEAOdzOzAjnczcwK5HA3MyuQw93MrEAOdzOzAjnczcwK5HA3MyuQw93MrEAOdzOzAjnczcwK5HA3MyuQw93MrEAOdzOzAjnczcwK1FW4S5og6fuS7pR0h6SdJU2U9FNJd+e/6/aqWDMzG5lut9xPAn4cES8HtgHuAI4CromIzYFr8mszMxtDow53SesArwNOBYiIv0bEQmBf4Iw82BnAft0WaWZmnelmy30TYAg4TdLvJX1X0prA+hHxcB5mHrB+t0WamVlnugn3AWA74OSI2BZYQkMTTEQEEM1GlnSopFmSZg0NDXVRhpmZNeom3OcAcyLixvz6+6Swf0TSBgD57/xmI0fEzIiYHhHTBwcHuyjDzMwajTrcI2Ie8KCkLXOnPYDbgcuAGbnbDODSrio0M7OODXQ5/keAcyStAtwLvIf0hXGhpEOA+4EDupyHmZl1qKtwj4hbgOlNeu3RzXTNzKw7/oWqmVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBug53SeMk/V7Sj/LrTSTdKGm2pAskrdJ9mWZm1olebLl/FLij7vWXgK9FxGbAAuCQHszDzMw60FW4S5oKvBn4bn4t4PXA9/MgZwD7dTMPMzPrXLdb7icCnwReyK/XAxZGxNL8eg4wpct5mJlZh0Yd7pL2BuZHxE2jHP9QSbMkzRoaGhptGWZm1kQ3W+67Am+RdB9wPqk55iRggqSBPMxUYG6zkSNiZkRMj4jpg4ODXZRhZmaNRh3uEfGpiJgaEdOAg4CfR8TBwLXA/nmwGcClXVdpZmYdWRHnuR8JfFzSbFIb/KkrYB5mZjaMgfaDtBcR1wHX5ef3Ajv0YrpmZjY6/oWqmVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWoFGHu6QNJV0r6XZJt0n6aO4+UdJPJd2d/67bu3LNzGwkutlyXwp8IiK2AnYCPixpK+Ao4JqI2By4Jr82M7MxNOpwj4iHI+Lm/HwxcAcwBdgXOCMPdgawX7dFmplZZ3rS5i5pGrAtcCOwfkQ8nHvNA9bvxTzMzGzkug53SWsBFwMfi4gn6vtFRADRYrxDJc2SNGtoaKjbMszMrE5X4S5pZVKwnxMRP8idH5G0Qe6/ATC/2bgRMTMipkfE9MHBwW7KMDOzBt2cLSPgVOCOiPhqXa/LgBn5+Qzg0tGXZ2ZmozHQxbi7Au8EbpV0S+72aeB44EJJhwD3Awd0V6KZmXVq1OEeEb8C1KL3HqOdrpmZdc+/UDUzK5DD3cysQA53M7MCOdzNzArkcDczK5DD3cysQA53M7MCOdzNzArkcDczK5DD3cysQA53M7MCOdzNzArkcDczK5DD3cysQA53M7MCOdzNzArkcDczK5DD3cysQA53M7MCOdzNzArkcDczK5DD3cysQA53M7MCOdzNzArkcDczK5DD3cysQA53M7MCOdzNzArkcDczK5DD3cysQA53M7MCOdzNzArkcDczK9AKCXdJe0q6S9JsSUetiHmYmVlrPQ93SeOAbwJ7AVsBb5O0Va/nY2Zmra2ILfcdgNkRcW9E/BU4H9h3BczHzMxaUET0doLS/sCeEfG+/PqdwI4RcVjDcIcCh+aXWwJ39bSQFWMS8GjVRRTEy7N3vCx7q1+W58YRMdisx8BYV1ITETOBmVXNfzQkzYqI6VXXUQovz97xsuytEpbnimiWmQtsWPd6au5mZmZjZEWE+++AzSVtImkV4CDgshUwHzMza6HnzTIRsVTSYcBPgHHA9yLitl7PpyJ91YzUB7w8e8fLsrf6fnn2/ICqmZlVz79QNTMrkMPdzKxADnczswI53M0KIGldSa+qug578fAB1TYkfRQ4DVgMfBfYFjgqIq6utLA+JWkXYBp1Z2pFxJmVFdTHJF0HvIW0LG8C5gO/joiPV1lXP5F0OdAyBCPiLWNYTk9V9gvVPvLeiDhJ0huBdYF3AmcBDvcOSToL2BS4BXg+dw7A4T4660TEE5LeB5wZEUdL+mPVRfWZE/LftwIvBc7Or98GPFJJRT3icG9P+e+bgLMi4jZJGm4Ea2k6sFV4d7FXBiRtABwAfKbqYvpRRFwPIOk/Gy43cLmkWRWV1RNuc2/vJklXk8L9J5LGAy9UXFO/+hNp68h64/OkHwvOjojfSXoZcHfFNfWrNfPyA0DSJsCaFdbTNbe5tyFpJeDVwL0RsVDSRGBqRHj3d4Tq2jXHk5blb4Fna/37uV2zSpJ2jYhft+tm7Unak/Sr1HtJe+sbAx+IiJ9UWlgXHO5tSNoVuCUilkh6B7AdcFJE3F9xaX1D0j8O17+2a2ydkXRzRGzXrpuNjKRVgZfnl3dGxLPDDf9i5zb39k4GtpG0DfAJ0hkzZwLDBpYtU9euuQnwcEQ8k1+vDqxfZW39SNLOwC7AoKT6M2PWJl3PyUZI0ltb9NpUEhHxgzEtqIcc7u0tjYiQtC/wjYg4VdIhVRfVpy4ihVLN87nb9tWU07dWAdYifX7H13V/Ati/kor61z7D9AvA4V6wxZI+RToFcrfcBr9yxTX1q4F860UAIuKv+bLQ1oGIuF7Sr4BXRcTnq66nn0XEe/Jnev+IuLDqenrJZ8u0dyDp4N97I2Ie6eYjX6m2pL41JOlvB0/z3lA/3MrsRScingcmV11HCSLiBeCTVdfRaz6gOgKSNgY2j4ifSVoDGBcRi6uuq99I2hQ4B5hC2uWdA7wrImZXWlifknQyaVleBCypde/nduKqSDqetKFxAcsvy8crK6pLDvc2JL2fdCPviRGxqaTNgW9HxB4Vl9a3JK0FEBFPVl1LP5N0WpPOERHvHfNi+pykvzTpHBHxsibd+4LDvQ1JtwA7ADdGxLa5260R8Q/VVtZ/JK0PfBGYHBF7SdoK2DkiTq24NLPiuM29vWfrDwJKGmCYCw3ZsE4n/aKy1lb8Z+BjlVXT5yRNlXSJpPn5cbGkqVXX1Y8krSzpcEnfz4/DJPX1iRMO9/aul/RpYHVJ/4vUvnl5xTX1q0n5jIQXIN1vl2UXELPOnUa6+fzk/Lg8d7POnQy8BvhWfrwmd+tbPhWyvaOAQ4BbgQ8AV5J+yGQjJGkgB/kSSeuR93wk7QQsqrS4/jYYEfVhfrok7wmNzvYRsU3d659L+kNl1fSAw72NfJrUKflho/Nb0mUbPkHa0txU0q+BQfyjm248li+JcV5+/TbgsQrr6WfPS9o0Iu4ByBcR6+u9Sh9QbSNfW+YY0oWEBkgXFerro+hjTdLv6w5GDwBbkpbjXRHxXKXF9bF8iu5/ATvnTr8GDo+IB6qrqr/kPZ3/BiaQNuBqZ81MI/225ecVldY1h3sbku4EjiDd6eZv3+QR4S2kEZI0B/hqq/4R0bKf2Yok6QTSJTFeQbpc8hzgWuDiiHioytq65WaZ9hZFxFVVF9HnxpGuheKbnPRQbjo4CdiJdBzjN8AREXFvpYX1kYj4N4B8GYzppKDfHfiUpIURsVWF5XXF4d7etZK+QrqAUP01yG+urqS+My8ivlB1EQU6F/gm8C/59UGk9vcdK6uof61OuqrmOvnxEOkkir7lZpk2JF3bpHNExOvHvJg+5WuMrxiS/hgRr2ro9oeGsz5sGJJmAlsDi4EbgRuAGyJiQaWF9YC33NuIiH+quoYCTJB0EnAVcF3teu7WtaskHQWcT2qWORC4Mt8trK+vizKGNgJWJbW3zyW1uS+stKIe8ZZ7C5LeERFnN9wM4W98EHDk8hkyrwX2BP6JdLreT4CrIuLPVdbWz+quh1L7ENcf0/AZXSOUb3i/Nam9fRfglcDjwG8i4ugqa+uGt9xbq90cd/ywQ1lb+QdM1+UHkiaTgv5YSZuRdoM/VFmBfUbS9sCDEbFJfj0D+N/AfcAx3mLvTKQt3D9JWkj6Ud0iYG/SNaX6Nty95W6VyjdK2Nk3dR45STcDb4iIxyW9jtQs8xHSzcdfERH+YdgISTqcZVvsz5HOea89bs0/YuxL3nJvQ9KXgWOBp4EfA68inW52dqWF9RFJJ0bExyRdTpOLrkXEW5qMZq2Nq9s6PxCYGREXAxfnq5jayE0jXS/qiIh4uOJaesrh3t4/R8QnJf0Labf3rcAvAIf7yJ2V/55QaRXlGFd3vZ49SPcbqPFnugMR0fSYWgm8IrRXW0ZvBi6KiEXp+IuNVETclJ/OAp6u7epKGkc6U8E6cx7paqWPkvYofwmQj1/4QmwG+JK/I/GjfAmC1wDXSBoEfCrf6FwDrFH3enXgZxXV0rci4jjSRdhOB14byw6crURqezfzAdWRyOcNL4qI5/M9VNfON8u2Dki6JSJe3a6bmXXPzTJtSHpX3fP6XmeOfTV9b4mk7WqXbpA0ndSsYGY95nBvb/u656uRDmDdjMN9ND4GXCSpdrW9DUhne5hZj7lZpkOSJgDnR8SeVdfSL+p+dDMv35fyA6Szjm4HPucf3Zj1ng+odm4JsEnVRfSZ7wC1m4zvDHyadDXDBcDMqooyK5mbZdpo+OHNSsBWwIXVVdSX/KMbszHmcG+v/oc3S4H7I2JOVcX0Kf/oxmyM+YPVRkRcX3suaRK+AfFo+Ec3ZmPMB1RbkLQTcDzp0p//QfoJ/SRS08y7IuLHFZbXd/Ly3AC4OiKW5G5bAGv5rlZmvedwb0HSLNKBv3VIB/32iogbJL0cOC8itq20QDOzYfhsmdYGIuLqiLiIdA/QGwAi4s6K6zIza8vh3lr9dZwbf0Xp3R0ze1Fzs0wLkp4nndMu0gWunqr1AlaLiJWrqs3MrB2Hu5lZgdwsY2ZWIIe7mVmBHO5mZgVyuJuZFcjhbmZWoP8PJEXKlKlbccAAAAAASUVORK5CYII=\n"
          },
          "metadata": {
            "needs_background": "light"
          }
        }
      ],
      "source": [
        "train_df.y.value_counts().plot.bar(title=f'Equal Label Distribution in Train Dataset, total = {train_df.shape[0]}')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VovTU4YQ1ExM",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "### Train a new News Classifier from the base dataset\n",
        "\n",
        "We will call this the vanilla model and vanilla dataset, since did not modify it yet"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Create a unfitted classifier and set some training paramters\n",
        "unfitted_classifier = nlu.load('train.classifier')\n",
        "unfitted_classifier['trainable_classifier_dl'].setMaxEpochs(10)\n",
        "unfitted_classifier['trainable_classifier_dl'].setLr(0.0005)\n",
        "unfitted_classifier['trainable_classifier_dl'].setBatchSize(64)\n",
        "unfitted_classifier['trainable_classifier_dl'].setDropout(0.5)\n",
        "unfitted_classifier.print_info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "-JvkRNKyrki7",
        "outputId": "638c99d9-be40-4b1d-b4bd-5caed99cf475",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",
            ">>> component_list['trainable_classifier_dl'] has settable params:\n",
            "component_list['trainable_classifier_dl'].setMaxEpochs(10)     | Info: Maximum number of epochs to train | Currently set to : 10\n",
            "component_list['trainable_classifier_dl'].setLr(0.0005)        | Info: Learning Rate | Currently set to : 0.0005\n",
            "component_list['trainable_classifier_dl'].setBatchSize(64)     | Info: Batch size | Currently set to : 64\n",
            "component_list['trainable_classifier_dl'].setDropout(0.5)      | Info: Dropout coefficient | Currently set to : 0.5\n",
            "component_list['trainable_classifier_dl'].setEnableOutputLogs(True)  | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True\n",
            ">>> component_list['bert_sentence_embeddings@sent_small_bert_L2_128'] has settable params:\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setBatchSize(8)  | Info: Size of every batch | Currently set to : 8\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setIsLong(False)  | Info: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int. | Currently set to : False\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setMaxSentenceLength(128)  | Info: Max sentence length to process | Currently set to : 128\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setDimension(128)  | Info: Number of embedding dimensions | Currently set to : 128\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setCaseSensitive(False)  | Info: whether to ignore case in tokens for embeddings matching | Currently set to : False\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setStorageRef('sent_small_bert_L2_128')  | Info: unique reference name for identification | Currently set to : sent_small_bert_L2_128\n",
            ">>> component_list['document_assembler'] has settable params:\n",
            "component_list['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "91BLAcu30bjZ",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns\n",
        "# Since there are no\n",
        "fitted_vanilla_classifier = unfitted_classifier.fit(train_df)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Training Dataset Metrics"
      ],
      "metadata": {
        "id": "qhScpwo_Lp4b",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# predict with the trained pipeline on dataset and get predictions\n",
        "train_preds = fitted_vanilla_classifier.predict(train_df)\n",
        "train_preds[['classifier_dl','y','text']].head(5)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 261
        },
        "id": "tq4pT1STJUZx",
        "outputId": "86aace9b-a522-42ec-900e-14d55ba8f2ee",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "  classifier_dl         y                                               text\n",
              "0      Business  Business  Vodafone said today it remained keen on purcha...\n",
              "1      Business  Business  The Senate is expected to vote on the overall ...\n",
              "2      Business  Business  term suitor Foodland (FOA) yesterday with an a...\n",
              "3        Sports  Business  WASHINGTON : The Federal Reserve #39;s policy ...\n",
              "3        Sports  Business  WASHINGTON : The Federal Reserve #39;s policy ..."
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-d4f7d5f6-1db5-4a19-a963-4e34799313c9\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>classifier_dl</th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>Vodafone said today it remained keen on purcha...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>The Senate is expected to vote on the overall ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>term suitor Foodland (FOA) yesterday with an a...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Sports</td>\n",
              "      <td>Business</td>\n",
              "      <td>WASHINGTON : The Federal Reserve #39;s policy ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Sports</td>\n",
              "      <td>Business</td>\n",
              "      <td>WASHINGTON : The Federal Reserve #39;s policy ...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d4f7d5f6-1db5-4a19-a963-4e34799313c9')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-d4f7d5f6-1db5-4a19-a963-4e34799313c9 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-d4f7d5f6-1db5-4a19-a963-4e34799313c9');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 25
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Evaluate all predications\n",
        "train_metrics_not_augmented = classification_report(train_preds['y'], train_preds['classifier_dl'])\n",
        "print(train_metrics_not_augmented)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "t1pyTLaVDbVe",
        "outputId": "8b4c8b2d-88e6-47e9-ba74-0dd93836cf14",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.72      0.61      0.66       133\n",
            "    Sci/Tech       0.72      0.77      0.74       155\n",
            "      Sports       0.80      0.95      0.87       138\n",
            "       World       0.83      0.73      0.78       134\n",
            "\n",
            "    accuracy                           0.77       560\n",
            "   macro avg       0.77      0.76      0.76       560\n",
            "weighted avg       0.77      0.77      0.76       560\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Test Dataset Metrics"
      ],
      "metadata": {
        "id": "QNAoFQcZMbIH",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DIrl5M1A0tZF",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "outputId": "b5e0fdcb-919e-4635-8014-397db88341a1",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "  classifier_dl         y                                               text\n",
              "0      Business  Business  California lawyers who reached a \\$1.1 billion...\n",
              "1      Sci/Tech  Business  DreamWorks SKG, the studio that created the  q...\n",
              "2      Business  Business  Nike Inc. (NKE.N: Quote, Profile, Research) on...\n",
              "3      Business  Business  Automaker DaimlerChrysler AG said Wednesday it...\n",
              "4      Business  Business  Online holiday shoppers this year are making c..."
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-1ac66ada-81e7-472f-ac31-58e795d8f5fb\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>classifier_dl</th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>California lawyers who reached a \\$1.1 billion...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Business</td>\n",
              "      <td>DreamWorks SKG, the studio that created the  q...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>Nike Inc. (NKE.N: Quote, Profile, Research) on...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>Automaker DaimlerChrysler AG said Wednesday it...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>Online holiday shoppers this year are making c...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1ac66ada-81e7-472f-ac31-58e795d8f5fb')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-1ac66ada-81e7-472f-ac31-58e795d8f5fb button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-1ac66ada-81e7-472f-ac31-58e795d8f5fb');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 27
        }
      ],
      "source": [
        "# Get Predictions on Test Dataset, not seen by model before \n",
        "test_preds = fitted_vanilla_classifier.predict(test_df,output_level = 'document')\n",
        "\n",
        "test_preds[['classifier_dl','y','text']].head(5)"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "\n",
        "# Evaluate all test predictions and print them with test predictions for comparision\n",
        "test_metrics_not_augmented = classification_report(test_preds['y'], test_preds['classifier_dl'])\n",
        "\n",
        "sep = '_'*50\n",
        "print(sep,'Metrics on Test Dataset of model trained on vanilla dataset',sep)\n",
        "print(test_metrics_not_augmented)\n",
        "\n",
        "print(sep,'Metrics on vanilla Train datataset of model trained on vanilla dataset',sep)\n",
        "print(train_metrics_not_augmented)\n",
        "\n"
      ],
      "metadata": {
        "id": "bgJsSBbI3xXB",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "845e8514-a472-42e2-8a6f-1d12fe680d13",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "__________________________________________________ Metrics on Test Dataset of model trained on vanilla dataset __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.75      0.75      0.75       900\n",
            "    Sci/Tech       0.80      0.69      0.74       900\n",
            "      Sports       0.86      0.93      0.89       900\n",
            "       World       0.80      0.85      0.82       900\n",
            "\n",
            "    accuracy                           0.80      3600\n",
            "   macro avg       0.80      0.80      0.80      3600\n",
            "weighted avg       0.80      0.80      0.80      3600\n",
            "\n",
            "__________________________________________________ Metrics on vanilla Train datataset of model trained on vanilla dataset __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.72      0.61      0.66       133\n",
            "    Sci/Tech       0.72      0.77      0.74       155\n",
            "      Sports       0.80      0.95      0.87       138\n",
            "       World       0.83      0.73      0.78       134\n",
            "\n",
            "    accuracy                           0.77       560\n",
            "   macro avg       0.77      0.76      0.76       560\n",
            "weighted avg       0.77      0.77      0.76       560\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Aumented Training"
      ],
      "metadata": {
        "id": "0wA14tFCL5Th",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zPF4ipq30kXW",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "### Augmented Dataset Creation\n",
        "Lets create the prompts we will feed to GPT2.      \n",
        "We just take the original label + the first 15 characters       \n",
        "Created pr ompts have the pattern `<y>:<x[:n]>`  where x is the string and y the label for it"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U5KjRFpE7TDI",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 423
        },
        "outputId": "5f3549cc-f96b-43e5-9bc4-5e54a4be83fc",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "             y                                               text  \\\n",
              "133      World   John Kerry, Bob Kerrey. It's easy to get conf...   \n",
              "2422  Sci/Tech  Black Box Voting hopes to halt the use of Dieb...   \n",
              "2676  Sci/Tech  Half of Viagra tablets sold on the Internet ar...   \n",
              "2236     World  Walking may protect the elderly from developin...   \n",
              "2785  Sci/Tech  source contribution is the first time it has s...   \n",
              "...        ...                                                ...   \n",
              "2255  Sci/Tech    The number of children taking antidepressant...   \n",
              "3494     World  The nation's Roman Catholic bishops said Frida...   \n",
              "3825    Sports   The infield at Fenway Park was covered with a...   \n",
              "3606     World   Ballot boxes poured into counting centers Mon...   \n",
              "5     Sci/Tech  The British Department for Education and Skill...   \n",
              "\n",
              "      origin_index  text_len                            base_context  \n",
              "133            133        51     World-News-Headline: John Kerry, Bo  \n",
              "2422          2422        68  Sci/Tech-News-Headline:Black Box Votin  \n",
              "2676          2676        72  Sci/Tech-News-Headline:Half of Viagra   \n",
              "2236          2236        76     World-News-Headline:Walking may pro  \n",
              "2785          2785        80  Sci/Tech-News-Headline:source contribu  \n",
              "...            ...       ...                                     ...  \n",
              "2255          2255       387  Sci/Tech-News-Headline:  The number of  \n",
              "3494          3494       434     World-News-Headline:The nation's Ro  \n",
              "3825          3825       435    Sports-News-Headline: The infield at  \n",
              "3606          3606       530     World-News-Headline: Ballot boxes p  \n",
              "5                5       780  Sci/Tech-News-Headline:The British Dep  \n",
              "\n",
              "[400 rows x 5 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-0851dc7c-c6d3-4e09-8263-56c5e9b3ef8e\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "      <th>origin_index</th>\n",
              "      <th>text_len</th>\n",
              "      <th>base_context</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>133</th>\n",
              "      <td>World</td>\n",
              "      <td>John Kerry, Bob Kerrey. It's easy to get conf...</td>\n",
              "      <td>133</td>\n",
              "      <td>51</td>\n",
              "      <td>World-News-Headline: John Kerry, Bo</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2422</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Black Box Voting hopes to halt the use of Dieb...</td>\n",
              "      <td>2422</td>\n",
              "      <td>68</td>\n",
              "      <td>Sci/Tech-News-Headline:Black Box Votin</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2676</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Half of Viagra tablets sold on the Internet ar...</td>\n",
              "      <td>2676</td>\n",
              "      <td>72</td>\n",
              "      <td>Sci/Tech-News-Headline:Half of Viagra</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2236</th>\n",
              "      <td>World</td>\n",
              "      <td>Walking may protect the elderly from developin...</td>\n",
              "      <td>2236</td>\n",
              "      <td>76</td>\n",
              "      <td>World-News-Headline:Walking may pro</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2785</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>source contribution is the first time it has s...</td>\n",
              "      <td>2785</td>\n",
              "      <td>80</td>\n",
              "      <td>Sci/Tech-News-Headline:source contribu</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2255</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>The number of children taking antidepressant...</td>\n",
              "      <td>2255</td>\n",
              "      <td>387</td>\n",
              "      <td>Sci/Tech-News-Headline:  The number of</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3494</th>\n",
              "      <td>World</td>\n",
              "      <td>The nation's Roman Catholic bishops said Frida...</td>\n",
              "      <td>3494</td>\n",
              "      <td>434</td>\n",
              "      <td>World-News-Headline:The nation's Ro</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3825</th>\n",
              "      <td>Sports</td>\n",
              "      <td>The infield at Fenway Park was covered with a...</td>\n",
              "      <td>3825</td>\n",
              "      <td>435</td>\n",
              "      <td>Sports-News-Headline: The infield at</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3606</th>\n",
              "      <td>World</td>\n",
              "      <td>Ballot boxes poured into counting centers Mon...</td>\n",
              "      <td>3606</td>\n",
              "      <td>530</td>\n",
              "      <td>World-News-Headline: Ballot boxes p</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>The British Department for Education and Skill...</td>\n",
              "      <td>5</td>\n",
              "      <td>780</td>\n",
              "      <td>Sci/Tech-News-Headline:The British Dep</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>400 rows × 5 columns</p>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-0851dc7c-c6d3-4e09-8263-56c5e9b3ef8e')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-0851dc7c-c6d3-4e09-8263-56c5e9b3ef8e button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-0851dc7c-c6d3-4e09-8263-56c5e9b3ef8e');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 38
        }
      ],
      "source": [
        "slice_len = 15 \n",
        "prompt_label_prefix='-News-Headline:'\n",
        "\n",
        "train_df['text_len'] = train_df.text.str.len()\n",
        "train_df = train_df.sort_values('text_len')\n",
        "train_df['base_context'] =  train_df['y']+ prompt_label_prefix + train_df['text'].str.slice(0,slice_len)\n",
        "train_df"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "K9q9369Y2tke",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "6dbb3b33-45f0-40b1-8973-ed0656c4246a",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",
            ">>> component_list['gpt2'] has settable params:\n",
            "component_list['gpt2'].setBatchSize(4)                         | Info: Size of every batch | Currently set to : 4\n",
            "component_list['gpt2'].setIgnoreTokenIds([])                   | Info: A list of token ids which are ignored in the decoder's output | Currently set to : []\n",
            "component_list['gpt2'].setRepetitionPenalty(1.0)               | Info: The parameter for repetition penalty. 1.0 means no penalty. See `this paper <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details | Currently set to : 1.0\n",
            "component_list['gpt2'].setTask('')                             | Info: Transformer's task, e.g. 'is it true that'> | Currently set to : \n",
            "component_list['gpt2'].setTemperature(1.0)                     | Info: The value used to module the next token probabilities | Currently set to : 1.0\n",
            "component_list['gpt2'].setTopP(0.8)                            | Info: If set to float < 1, only the most probable tokens with probabilities that add up to ``top_p`` or higher are kept for generation | Currently set to : 0.8\n",
            "component_list['gpt2'].setMinOutputLength(10)                  | Info: Minimum length of the sequence to be generated | Currently set to : 10\n",
            "component_list['gpt2'].setMaxOutputLength(30)                  | Info: Maximum length of output text | Currently set to : 30\n",
            "component_list['gpt2'].setDoSample(True)                       | Info: Whether or not to use sampling; use greedy decoding otherwise | Currently set to : True\n",
            "component_list['gpt2'].setTopK(50)                             | Info: The number of highest probability vocabulary tokens to keep for top-k-filtering | Currently set to : 50\n",
            "component_list['gpt2'].setNoRepeatNgramSize(3)                 | Info: If set to int > 0, all ngrams of that size can only occur once | Currently set to : 3\n",
            ">>> component_list['document_assembler'] has settable params:\n",
            "component_list['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "DocumentAssembler_de4915f4985b"
            ]
          },
          "metadata": {},
          "execution_count": 39
        }
      ],
      "source": [
        "gpt2_pipe.print_info()\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(50)\n",
        "gpt2_pipe['gpt2'].setTopP(1) \n",
        "gpt2_pipe['document_assembler'].setCleanupMode('shrink')"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "yR-86aV8RtBU",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "E79GS_cj4Tzm",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "f2ec3d19-5bfe-496f-a5ca-afa03b43bae6",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "__________________________________________________ Generating for label=World __________________________________________________\n",
            "__________________________________________________ Generating new training data example 0/10 for prompt : <World-News-Headline: John Kerry, Bo> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: John Kerry, Boers On Syria, Clinton To Talk With American Foreign Policy Team. Newsweek, 7 December 2011. [English]\n",
            "\n",
            "Bozick, B. K. (2003 Dec, Vol. 38\n",
            "__________________________________________________ Generating new training data example 1/10 for prompt : <World-News-Headline:Walking may pro> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Walking may propped up the body of a man who had apparently taken his own life after walking with a cane at a West Asheville high school Monday to begin his morning walk.\n",
            "\n",
            "In the video, which appears\n",
            "__________________________________________________ Generating new training data example 2/10 for prompt : <World-News-Headline:An industrial c> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:An industrial coterie under fire:\n",
            "\n",
            "One of the most respected industrial plants in Britain has been hit by an incident involving a truck being driven from the plant. The incident happened on Tuesday at the plant near the\n",
            "__________________________________________________ Generating new training data example 3/10 for prompt : <World-News-Headline:Richard Faulds > __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Richard Faulds will have his first call upon his job to the world's attention. He will have a huge amount of weight to lose on his face, and some muscle to gain on his legs. He's going\n",
            "__________________________________________________ Generating new training data example 4/10 for prompt : <World-News-Headline:The radioactive> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:The radioactive material of the Fukushima Daiichi nuclear plant in Tokyo, Japan. Image: Yuriko Nakajima, International Center for Studies at the University of Tokyo. In the background, a team working on the project\n",
            "__________________________________________________ Generating new training data example 5/10 for prompt : <World-News-Headline:Greek weightlif> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Greek weightlifters found a way of lifting with the goal of helping the Greek population to gain weight\n",
            "\n",
            "Sylvanas Koson/AFP/Getty Images The three Olympians were all from Greece, and their\n",
            "__________________________________________________ Generating new training data example 6/10 for prompt : <World-News-Headline:Three Indian tr> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Three Indian trimmers are in hospital after falling down a mountain on Everest\n",
            "\n",
            "The Indian National Tourist Board says the two men were climbing on the Himalayan mountain and reached their destination on September 15 with no injuries.\n",
            "__________________________________________________ Generating new training data example 7/10 for prompt : <World-News-Headline:Islamic group #> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Islamic group #Turkey is 'leading the fight against terrorism,' 'Kurdish forces help militants carry out raid on ISIS camps.' Turkish soldiers protect civilians during raids in Istanbul on Saturday, 15 June 2017. Photo: Michael\n",
            "__________________________________________________ Generating new training data example 8/10 for prompt : <World-News-Headline:On the fifth an> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:On the fifth an-agram has replaced the word \"crown\". (Image Credit: Getty Images)\n",
            "\n",
            "The English cosset's royal eagle hangs above the English crows nest. An English cower bird\n",
            "__________________________________________________ Generating new training data example 9/10 for prompt : <World-News-Headline:Turkey's foreig> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Turkey's foreigms are on the rise, amid growing anxiety in Turkey around its nuclear deal with Iran\n",
            "\n",
            "\"The number of new EU citizens in 2014 was estimated at 5,900, more than a quarter of\n",
            "__________________________________________________ Generating for label=Sci/Tech __________________________________________________\n",
            "__________________________________________________ Generating new training data example 0/10 for prompt : <Sci/Tech-News-Headline:Black Box Votin> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Black Box Votinium Sutra. http://bit.ly/2s6JmWp.\n",
            "\n",
            "This story was first published at http://www.citizen-net.org/2011\n",
            "__________________________________________________ Generating new training data example 1/10 for prompt : <Sci/Tech-News-Headline:Half of Viagra > __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Half of Viagra Will Make You Harder to Sleep\n",
            "\n",
            "Advertisement\n",
            "\n",
            "\n",
            "With all of this in mind, let's take a look at a few of his favorite ways to work on the pill.\n",
            "\n",
            "__________________________________________________ Generating new training data example 2/10 for prompt : <Sci/Tech-News-Headline:source contribu> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:source contribuited: 0,\n",
            "\n",
            "http://www.wsj.com/story/news/technology/2017/06/17/dodgy-computer-in-cricket\n",
            "__________________________________________________ Generating new training data example 3/10 for prompt : <Sci/Tech-News-Headline:Upgraded versio> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Upgraded versio-turks)\n",
            "\n",
            "(3):http://federal.gov/news/201506146083375053:8-foot-thick-printing:\n",
            "\n",
            "__________________________________________________ Generating new training data example 4/10 for prompt : <Sci/Tech-News-Headline:The latest Blac> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:The latest Blacqos news at a glance\n",
            "\n",
            "More BlacQos photos: Bozo is best known for being a TV host/presenter for the New York Times (where he hosted,\n",
            "__________________________________________________ Generating new training data example 5/10 for prompt : <Sci/Tech-News-Headline:computing confe> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:computing confe/content/8 (accessed 22 December 2012).\n",
            "\n",
            "23. http://www.hqpress-magazine.com/2011/12/06/top-100-\n",
            "__________________________________________________ Generating new training data example 6/10 for prompt : <Sci/Tech-News-Headline:&lt;strong&gt;I> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:&lt;strong&gt;I\n",
            "\n",
            "The new study found that an increasing proportion of US households who now own and consume energy through solar panels did so on average, compared with a recent decrease in this\n",
            "__________________________________________________ Generating new training data example 7/10 for prompt : <Sci/Tech-News-Headline:Some of the net> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Some of the nethers used in these experiments will be tested in more extreme settings, such as in a desert environment. NASA / JPL / XJPL-Caltech/Space Science Institute NASA scientists are\n",
            "__________________________________________________ Generating new training data example 8/10 for prompt : <Sci/Tech-News-Headline: IBM is rolling> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: IBM is rolling out a major cloud computing platform to its customers with a first.\n",
            "\n",
            "According to IBM, it has launched its own version of the popular WebApp, for Android, which was created by Google\n",
            "__________________________________________________ Generating new training data example 9/10 for prompt : <Sci/Tech-News-Headline:Judges send cas> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Judges send casters a petition to the U.S. Department of Justice: http://www.ccl.gov/news-releases/?x=1549&id=17772300\n",
            "__________________________________________________ Generating for label=Business __________________________________________________\n",
            "__________________________________________________ Generating new training data example 0/10 for prompt : <Business-News-Headline: In the great r> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline: In the great riven cities, there's no longer any kind of traffic lane or road for cyclists.\n",
            "\n",
            "When I met I had never ridden in a vehicle before, so it was very interesting to learn a new\n",
            "__________________________________________________ Generating new training data example 1/10 for prompt : <Business-News-Headline:The IRS is gunn> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:The IRS is gunnable as long as it doesn't interfere with your tax-collection activity.\n",
            "\n",
            "The IRS won't interfere. The IRS can \"pro-actively\" seek permission for you to file tax returns\n",
            "__________________________________________________ Generating new training data example 2/10 for prompt : <Business-News-Headline:The price of oi> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:The price of oi-ti-toy-cage on the Internet was $75 at the time, while for most consumers there was no such thing as a decent $10 of premium, and you had to\n",
            "__________________________________________________ Generating new training data example 3/10 for prompt : <Business-News-Headline:Members of Cali> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Members of Cali-Police, who were shot and killed near N.S., are trying to explain their acts and they may have some justification.On Thursday, a Cali resident, in his 20s and wearing\n",
            "__________________________________________________ Generating new training data example 4/10 for prompt : <Business-News-Headline:Oil prices hove> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Oil prices hove into the upper end, with the United States rising 16.12 percent, while Japan's economy is forecast to remain at its lowest level of growth in a decade. The country has a debt burden of\n",
            "__________________________________________________ Generating new training data example 5/10 for prompt : <Business-News-Headline: Employers step> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline: Employers step up 'special interest group' targeting job ad program\n",
            "\n",
            "The US Department of Labor announced late Wednesday it is expanding a government-funded group that has a mandate in recent weeks to lobby a federal agency on\n",
            "__________________________________________________ Generating new training data example 6/10 for prompt : <Business-News-Headline:NOW heres somet> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:NOW heres somethin' to take you by surprise. (MORRY RAY: Yes)\n",
            "\n",
            "And this is all part of an article on Drudge's blog. Now, that's not to say\n",
            "__________________________________________________ Generating new training data example 7/10 for prompt : <Business-News-Headline:Abbey National > __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Abbey National Forests National Forest, The Northwest Passage, Oregon State Parks, Pawnee National Recreation Area National Park, Cascade Mountains National Forester, Glacier National Park\n",
            "\n",
            "What is This Forest?\n",
            "\n",
            "Located\n",
            "__________________________________________________ Generating new training data example 8/10 for prompt : <Business-News-Headline:Although the th> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Although the thirteenth year after its inception, the World Trade Center bombing was never officially officially acknowledged as an attack by the United States. The International Civil Aviation Organization, a member of the Pentagon's Special Operations Service,\n",
            "__________________________________________________ Generating new training data example 9/10 for prompt : <Business-News-Headline:US Airways Grou> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:US Airways Groupe France, Boeing, Boeing of America, Airbus Group, Airbus Technology Inc, Asmara, Asperette, Aspect Management, BIMB & Sons, Bentley Americas, BHNL\n",
            "__________________________________________________ Generating for label=Sports __________________________________________________\n",
            "__________________________________________________ Generating new training data example 0/10 for prompt : <Sports-News-Headline: Reigning Major> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Reigning Major Soccer's Oldest Supporters Group\n",
            "\n",
            "Fans at the Orlando Pride's home-town stadium will pay tribute to an Orlando Pride star who inspired generations to keep their country running through the 21st Century.\n",
            "\n",
            "__________________________________________________ Generating new training data example 1/10 for prompt : <Sports-News-Headline:Michael Schumac> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Michael Schumac\n",
            "\n",
            "\"He didn't have to talk the whole game.\" — Ben Howland\n",
            "\n",
            "For Schumacs, the final year of his contract made him one of the most important coaches ever to\n",
            "__________________________________________________ Generating new training data example 2/10 for prompt : <Sports-News-Headline:day layoff fail> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:day layoff fail by Giants\n",
            "\n",
            "* Giants hit only three home runs and failed to score in the opening run of their second game\n",
            "\n",
            "A lot of people say that Giants reliever Zack Greinke would need seven\n",
            "__________________________________________________ Generating new training data example 3/10 for prompt : <Sports-News-Headline: Chad Johnson b> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Chad Johnson banged in double for team in double-block in overtime, but then missed another chance to get off the bench. 10:53\n",
            "\n",
            "BARNES HOSPITAL: The Jacksonville Wild Wings have clin\n",
            "__________________________________________________ Generating new training data example 4/10 for prompt : <Sports-News-Headline:age record of 1> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:age record of 1st time, only 4-2 by NBA\n",
            "\n",
            "It's just one more opportunity in the league for Cleveland - and for their star players who made it to the end of their respective seasons.\n",
            "\n",
            "__________________________________________________ Generating new training data example 5/10 for prompt : <Sports-News-Headline: The Boston Red> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: The Boston Red Sox (11-14, 7-6) sweep Boston's wild-card team at Fenway Park and begin their playoff showdown with the Boston Bruins on Wednesday night at 7 p.m. (ET\n",
            "__________________________________________________ Generating new training data example 6/10 for prompt : <Sports-News-Headline:Tony Dickens re> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Tony Dickens re-imagined as Peter Rabbit in an unrivaled fantasy '90s cartoon, and the world still loves it.\n",
            "\n",
            "• What's the difference between your '70s '90 shows and today\n",
            "__________________________________________________ Generating new training data example 7/10 for prompt : <Sports-News-Headline:The Browns star> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:The Browns star has not yet been traded, though he was reported back in the weekend of Sunday's practice. Cleveland said the news could not be independently confirmed Tuesday morning.\n",
            "\n",
            "Athletic center Mike Thomas said\n",
            "__________________________________________________ Generating new training data example 8/10 for prompt : <Sports-News-Headline:THENS, Aug. 18 > __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:THENS, Aug. 18, 2012, 04:03:08 PM »\n",
            "\n",
            "I will use the new information for any news reporting that you have of me or any news about either person, that is to say\n",
            "__________________________________________________ Generating new training data example 9/10 for prompt : <Sports-News-Headline: A roaring crow> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: A roaring crowbar has knocked out an 11-game winning streak with Boston.\n",
            "\n",
            "After Saturday's Game 1 loss to Los Angeles, the Celtics' first 10-game win streak snapped a seven-game skid\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                 document  \\\n",
              "0     World-News-Headline: John Kerry, Bo   \n",
              "0     World-News-Headline:Walking may pro   \n",
              "0     World-News-Headline:An industrial c   \n",
              "0      World-News-Headline:Richard Faulds   \n",
              "0     World-News-Headline:The radioactive   \n",
              "0     World-News-Headline:Greek weightlif   \n",
              "0     World-News-Headline:Three Indian tr   \n",
              "0     World-News-Headline:Islamic group #   \n",
              "0     World-News-Headline:On the fifth an   \n",
              "0     World-News-Headline:Turkey's foreig   \n",
              "0  Sci/Tech-News-Headline:Black Box Votin   \n",
              "0   Sci/Tech-News-Headline:Half of Viagra   \n",
              "0  Sci/Tech-News-Headline:source contribu   \n",
              "0  Sci/Tech-News-Headline:Upgraded versio   \n",
              "0  Sci/Tech-News-Headline:The latest Blac   \n",
              "0  Sci/Tech-News-Headline:computing confe   \n",
              "0  Sci/Tech-News-Headline:&lt;strong&gt;I   \n",
              "0  Sci/Tech-News-Headline:Some of the net   \n",
              "0  Sci/Tech-News-Headline: IBM is rolling   \n",
              "0  Sci/Tech-News-Headline:Judges send cas   \n",
              "0  Business-News-Headline: In the great r   \n",
              "0  Business-News-Headline:The IRS is gunn   \n",
              "0  Business-News-Headline:The price of oi   \n",
              "0  Business-News-Headline:Members of Cali   \n",
              "0  Business-News-Headline:Oil prices hove   \n",
              "0  Business-News-Headline: Employers step   \n",
              "0  Business-News-Headline:NOW heres somet   \n",
              "0   Business-News-Headline:Abbey National   \n",
              "0  Business-News-Headline:Although the th   \n",
              "0  Business-News-Headline:US Airways Grou   \n",
              "0    Sports-News-Headline: Reigning Major   \n",
              "0    Sports-News-Headline:Michael Schumac   \n",
              "0    Sports-News-Headline:day layoff fail   \n",
              "0    Sports-News-Headline: Chad Johnson b   \n",
              "0    Sports-News-Headline:age record of 1   \n",
              "0    Sports-News-Headline: The Boston Red   \n",
              "0    Sports-News-Headline:Tony Dickens re   \n",
              "0    Sports-News-Headline:The Browns star   \n",
              "0     Sports-News-Headline:THENS, Aug. 18   \n",
              "0    Sports-News-Headline: A roaring crow   \n",
              "\n",
              "                                           generated  \n",
              "0   World-News-Headline: John Kerry, Boers On Syr...  \n",
              "0   World-News-Headline:Walking may propped up th...  \n",
              "0   World-News-Headline:An industrial coterie und...  \n",
              "0   World-News-Headline:Richard Faulds will have ...  \n",
              "0   World-News-Headline:The radioactive material ...  \n",
              "0   World-News-Headline:Greek weightlifters found...  \n",
              "0   World-News-Headline:Three Indian trimmers are...  \n",
              "0   World-News-Headline:Islamic group #Turkey is ...  \n",
              "0   World-News-Headline:On the fifth an-agram has...  \n",
              "0   World-News-Headline:Turkey's foreigms are on ...  \n",
              "0   Sci/Tech-News-Headline:Black Box Votinium Sut...  \n",
              "0   Sci/Tech-News-Headline:Half of Viagra Will Ma...  \n",
              "0   Sci/Tech-News-Headline:source contribuited: 0...  \n",
              "0   Sci/Tech-News-Headline:Upgraded versio-turks)...  \n",
              "0   Sci/Tech-News-Headline:The latest Blacqos new...  \n",
              "0   Sci/Tech-News-Headline:computing confe/conten...  \n",
              "0   Sci/Tech-News-Headline:&lt;strong&gt;I\\n\\nThe...  \n",
              "0   Sci/Tech-News-Headline:Some of the nethers us...  \n",
              "0   Sci/Tech-News-Headline: IBM is rolling out a ...  \n",
              "0   Sci/Tech-News-Headline:Judges send casters a ...  \n",
              "0   Business-News-Headline: In the great riven ci...  \n",
              "0   Business-News-Headline:The IRS is gunnable as...  \n",
              "0   Business-News-Headline:The price of oi-ti-toy...  \n",
              "0   Business-News-Headline:Members of Cali-Police...  \n",
              "0   Business-News-Headline:Oil prices hove into t...  \n",
              "0   Business-News-Headline: Employers step up 'sp...  \n",
              "0   Business-News-Headline:NOW heres somethin' to...  \n",
              "0   Business-News-Headline:Abbey National Forests...  \n",
              "0   Business-News-Headline:Although the thirteent...  \n",
              "0   Business-News-Headline:US Airways Groupe Fran...  \n",
              "0   Sports-News-Headline: Reigning Major Soccer's...  \n",
              "0   Sports-News-Headline:Michael Schumac\\n\\n\"He d...  \n",
              "0   Sports-News-Headline:day layoff fail by Giant...  \n",
              "0   Sports-News-Headline: Chad Johnson banged in ...  \n",
              "0   Sports-News-Headline:age record of 1st time, ...  \n",
              "0   Sports-News-Headline: The Boston Red Sox (11-...  \n",
              "0   Sports-News-Headline:Tony Dickens re-imagined...  \n",
              "0   Sports-News-Headline:The Browns star has not ...  \n",
              "0   Sports-News-Headline:THENS, Aug. 18, 2012, 04...  \n",
              "0   Sports-News-Headline: A roaring crowbar has k...  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-78fb855c-4ed7-4b10-9a13-7aebb8e5e0c9\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>document</th>\n",
              "      <th>generated</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline: John Kerry, Bo</td>\n",
              "      <td>World-News-Headline: John Kerry, Boers On Syr...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:Walking may pro</td>\n",
              "      <td>World-News-Headline:Walking may propped up th...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:An industrial c</td>\n",
              "      <td>World-News-Headline:An industrial coterie und...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:Richard Faulds</td>\n",
              "      <td>World-News-Headline:Richard Faulds will have ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:The radioactive</td>\n",
              "      <td>World-News-Headline:The radioactive material ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:Greek weightlif</td>\n",
              "      <td>World-News-Headline:Greek weightlifters found...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:Three Indian tr</td>\n",
              "      <td>World-News-Headline:Three Indian trimmers are...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:Islamic group #</td>\n",
              "      <td>World-News-Headline:Islamic group #Turkey is ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:On the fifth an</td>\n",
              "      <td>World-News-Headline:On the fifth an-agram has...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World-News-Headline:Turkey's foreig</td>\n",
              "      <td>World-News-Headline:Turkey's foreigms are on ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:Black Box Votin</td>\n",
              "      <td>Sci/Tech-News-Headline:Black Box Votinium Sut...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:Half of Viagra</td>\n",
              "      <td>Sci/Tech-News-Headline:Half of Viagra Will Ma...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:source contribu</td>\n",
              "      <td>Sci/Tech-News-Headline:source contribuited: 0...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:Upgraded versio</td>\n",
              "      <td>Sci/Tech-News-Headline:Upgraded versio-turks)...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:The latest Blac</td>\n",
              "      <td>Sci/Tech-News-Headline:The latest Blacqos new...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:computing confe</td>\n",
              "      <td>Sci/Tech-News-Headline:computing confe/conten...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:&amp;lt;strong&amp;gt;I</td>\n",
              "      <td>Sci/Tech-News-Headline:&amp;lt;strong&amp;gt;I\\n\\nThe...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:Some of the net</td>\n",
              "      <td>Sci/Tech-News-Headline:Some of the nethers us...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline: IBM is rolling</td>\n",
              "      <td>Sci/Tech-News-Headline: IBM is rolling out a ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sci/Tech-News-Headline:Judges send cas</td>\n",
              "      <td>Sci/Tech-News-Headline:Judges send casters a ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline: In the great r</td>\n",
              "      <td>Business-News-Headline: In the great riven ci...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline:The IRS is gunn</td>\n",
              "      <td>Business-News-Headline:The IRS is gunnable as...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline:The price of oi</td>\n",
              "      <td>Business-News-Headline:The price of oi-ti-toy...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline:Members of Cali</td>\n",
              "      <td>Business-News-Headline:Members of Cali-Police...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline:Oil prices hove</td>\n",
              "      <td>Business-News-Headline:Oil prices hove into t...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline: Employers step</td>\n",
              "      <td>Business-News-Headline: Employers step up 'sp...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline:NOW heres somet</td>\n",
              "      <td>Business-News-Headline:NOW heres somethin' to...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline:Abbey National</td>\n",
              "      <td>Business-News-Headline:Abbey National Forests...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline:Although the th</td>\n",
              "      <td>Business-News-Headline:Although the thirteent...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business-News-Headline:US Airways Grou</td>\n",
              "      <td>Business-News-Headline:US Airways Groupe Fran...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline: Reigning Major</td>\n",
              "      <td>Sports-News-Headline: Reigning Major Soccer's...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline:Michael Schumac</td>\n",
              "      <td>Sports-News-Headline:Michael Schumac\\n\\n\"He d...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline:day layoff fail</td>\n",
              "      <td>Sports-News-Headline:day layoff fail by Giant...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline: Chad Johnson b</td>\n",
              "      <td>Sports-News-Headline: Chad Johnson banged in ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline:age record of 1</td>\n",
              "      <td>Sports-News-Headline:age record of 1st time, ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline: The Boston Red</td>\n",
              "      <td>Sports-News-Headline: The Boston Red Sox (11-...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline:Tony Dickens re</td>\n",
              "      <td>Sports-News-Headline:Tony Dickens re-imagined...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline:The Browns star</td>\n",
              "      <td>Sports-News-Headline:The Browns star has not ...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline:THENS, Aug. 18</td>\n",
              "      <td>Sports-News-Headline:THENS, Aug. 18, 2012, 04...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports-News-Headline: A roaring crow</td>\n",
              "      <td>Sports-News-Headline: A roaring crowbar has k...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-78fb855c-4ed7-4b10-9a13-7aebb8e5e0c9')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-78fb855c-4ed7-4b10-9a13-7aebb8e5e0c9 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-78fb855c-4ed7-4b10-9a13-7aebb8e5e0c9');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 40
        }
      ],
      "source": [
        "# Generate a 100 samples, takes around 20 min on Colab with GPU \n",
        "# We take 25 per label, in total 100\n",
        "num_aug_per_label = 10\n",
        "aug_data = []\n",
        "\n",
        "# FYI we are kinda generating fake news here lol please dont use this for naughty things\n",
        "for label in train_df.y.unique():\n",
        "  print(sep,f'Generating for label={label}',sep)\n",
        "  prompt_df = train_df[train_df.y==label]\n",
        "  for n in range(num_aug_per_label):\n",
        "    print(sep,f'Generating new training data example {n}/{num_aug_per_label} for prompt : <{prompt_df.iloc[n].base_context}>',sep)\n",
        "    text = prompt_df.iloc[n].base_context\n",
        "    generation = gpt2_pipe.predict(text) \n",
        "    aug_data.append(generation)\n",
        "    print('Generated : \\n',generation['generated'].iloc[0])\n",
        "aug_df = pd.concat(aug_data)\n",
        "aug_df"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hAb8aoq4BSdx",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "outputId": "516ff1f4-dce5-491d-b2b2-da40b01a0abe",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "       y                                               text\n",
              "0  World  John Kerry, Bo Mortensen, Donald Rumsfeld\\n\\nL...\n",
              "0  World  Walking may procreate, but Human Life has a La...\n",
              "0  World  An industrial cauldron of uncertainty rises in...\n",
              "0  World  Richard Faulds argues that all political corre...\n",
              "0  World  The radioactive contamination of the soil, sea..."
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-fad4c8d8-2061-4f97-8dac-49d8c7157a11\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World</td>\n",
              "      <td>John Kerry, Bo Mortensen, Donald Rumsfeld\\n\\nL...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World</td>\n",
              "      <td>Walking may procreate, but Human Life has a La...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World</td>\n",
              "      <td>An industrial cauldron of uncertainty rises in...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World</td>\n",
              "      <td>Richard Faulds argues that all political corre...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>World</td>\n",
              "      <td>The radioactive contamination of the soil, sea...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-fad4c8d8-2061-4f97-8dac-49d8c7157a11')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-fad4c8d8-2061-4f97-8dac-49d8c7157a11 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-fad4c8d8-2061-4f97-8dac-49d8c7157a11');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 34
        }
      ],
      "source": [
        "# Lets remove the <label> part from the generated text, so we are not leaking the label\n",
        "aug_df = pd.concat(aug_data)\n",
        "aug_df['y'] = aug_df.document.apply(lambda x : x.split('-News-Headline:')[0])\n",
        "aug_df['text'] = aug_df.generated.apply(lambda x : x.split('-News-Headline:')[1])\n",
        "aug_df = aug_df[['y','text']]\n",
        "aug_df.head(5)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0ApkR8S34wv_",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 460
        },
        "outputId": "d63fd322-5a1f-41fc-feef-b0d7f8f38280",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Augmented Dataset Size :  440\n",
            "Vanilla Dataset size :  400\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "             y                                               text  \\\n",
              "133      World   John Kerry, Bob Kerrey. It's easy to get conf...   \n",
              "2422  Sci/Tech  Black Box Voting hopes to halt the use of Dieb...   \n",
              "2676  Sci/Tech  Half of Viagra tablets sold on the Internet ar...   \n",
              "2236     World  Walking may protect the elderly from developin...   \n",
              "2785  Sci/Tech  source contribution is the first time it has s...   \n",
              "...        ...                                                ...   \n",
              "0       Sports  The Boston Red Sox made a remarkable comeback ...   \n",
              "0       Sports  The Browns star turned right tackle who has ma...   \n",
              "0       Sports  Tony Dickens reels off NSW Test debut\\n\\nEwan ...   \n",
              "0       Sports  THENS, Aug. 18 (UPI) -- As athletes and sports...   \n",
              "0       Sports  A roaring crow dives in on a brutal crash on H...   \n",
              "\n",
              "      origin_index  text_len                    base_context  \n",
              "133          133.0      51.0     World-News:  John Kerry, Bo  \n",
              "2422        2422.0      68.0  Sci/Tech-News: Black Box Votin  \n",
              "2676        2676.0      72.0  Sci/Tech-News: Half of Viagra   \n",
              "2236        2236.0      76.0     World-News: Walking may pro  \n",
              "2785        2785.0      80.0  Sci/Tech-News: source contribu  \n",
              "...            ...       ...                             ...  \n",
              "0              NaN       NaN                             NaN  \n",
              "0              NaN       NaN                             NaN  \n",
              "0              NaN       NaN                             NaN  \n",
              "0              NaN       NaN                             NaN  \n",
              "0              NaN       NaN                             NaN  \n",
              "\n",
              "[440 rows x 5 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-1a2b5b81-3f33-4895-ae50-e5371af9ac41\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "      <th>origin_index</th>\n",
              "      <th>text_len</th>\n",
              "      <th>base_context</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>133</th>\n",
              "      <td>World</td>\n",
              "      <td>John Kerry, Bob Kerrey. It's easy to get conf...</td>\n",
              "      <td>133.0</td>\n",
              "      <td>51.0</td>\n",
              "      <td>World-News:  John Kerry, Bo</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2422</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Black Box Voting hopes to halt the use of Dieb...</td>\n",
              "      <td>2422.0</td>\n",
              "      <td>68.0</td>\n",
              "      <td>Sci/Tech-News: Black Box Votin</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2676</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Half of Viagra tablets sold on the Internet ar...</td>\n",
              "      <td>2676.0</td>\n",
              "      <td>72.0</td>\n",
              "      <td>Sci/Tech-News: Half of Viagra</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2236</th>\n",
              "      <td>World</td>\n",
              "      <td>Walking may protect the elderly from developin...</td>\n",
              "      <td>2236.0</td>\n",
              "      <td>76.0</td>\n",
              "      <td>World-News: Walking may pro</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2785</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>source contribution is the first time it has s...</td>\n",
              "      <td>2785.0</td>\n",
              "      <td>80.0</td>\n",
              "      <td>Sci/Tech-News: source contribu</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports</td>\n",
              "      <td>The Boston Red Sox made a remarkable comeback ...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports</td>\n",
              "      <td>The Browns star turned right tackle who has ma...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports</td>\n",
              "      <td>Tony Dickens reels off NSW Test debut\\n\\nEwan ...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports</td>\n",
              "      <td>THENS, Aug. 18 (UPI) -- As athletes and sports...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports</td>\n",
              "      <td>A roaring crow dives in on a brutal crash on H...</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>440 rows × 5 columns</p>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-1a2b5b81-3f33-4895-ae50-e5371af9ac41')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-1a2b5b81-3f33-4895-ae50-e5371af9ac41 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-1a2b5b81-3f33-4895-ae50-e5371af9ac41');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 35
        }
      ],
      "source": [
        "augmented_train_df = train_df.append(aug_df)\n",
        "print(\"Augmented Dataset Size : \", augmented_train_df.shape[0])\n",
        "print(\"Vanilla Dataset size : \", train_df.shape[0])\n",
        "\n",
        "augmented_train_df"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "15vGj-vrB4WD",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "###  Train a new Model with Augmented Data\n",
        "\n",
        "We ewill call this the augmented model"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Create a unfitted classifier and set some training paramters\n",
        "unfitted_classifier = nlu.load('train.classifier')\n",
        "unfitted_classifier['trainable_classifier_dl'].setMaxEpochs(10)\n",
        "unfitted_classifier['trainable_classifier_dl'].setLr(0.0005)\n",
        "unfitted_classifier['trainable_classifier_dl'].setBatchSize(64)\n",
        "unfitted_classifier['trainable_classifier_dl'].setDropout(0.5)\n",
        "unfitted_classifier.print_info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "pDbfVPYksFJe",
        "outputId": "0f47f349-d66e-4f1f-c141-b3e470bf885a",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :\n",
            ">>> component_list['trainable_classifier_dl'] has settable params:\n",
            "component_list['trainable_classifier_dl'].setMaxEpochs(10)     | Info: Maximum number of epochs to train | Currently set to : 10\n",
            "component_list['trainable_classifier_dl'].setLr(0.0005)        | Info: Learning Rate | Currently set to : 0.0005\n",
            "component_list['trainable_classifier_dl'].setBatchSize(64)     | Info: Batch size | Currently set to : 64\n",
            "component_list['trainable_classifier_dl'].setDropout(0.5)      | Info: Dropout coefficient | Currently set to : 0.5\n",
            "component_list['trainable_classifier_dl'].setEnableOutputLogs(True)  | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True\n",
            ">>> component_list['bert_sentence_embeddings@sent_small_bert_L2_128'] has settable params:\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setBatchSize(8)  | Info: Size of every batch | Currently set to : 8\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setIsLong(False)  | Info: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int. | Currently set to : False\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setMaxSentenceLength(128)  | Info: Max sentence length to process | Currently set to : 128\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setDimension(128)  | Info: Number of embedding dimensions | Currently set to : 128\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setCaseSensitive(False)  | Info: whether to ignore case in tokens for embeddings matching | Currently set to : False\n",
            "component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setStorageRef('sent_small_bert_L2_128')  | Info: unique reference name for identification | Currently set to : sent_small_bert_L2_128\n",
            ">>> component_list['document_assembler'] has settable params:\n",
            "component_list['document_assembler'].setCleanupMode('shrink')  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wczNogmmBMkY",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "outputId": "81e26276-0b67-4df7-904e-be4c7817aa8b",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "  classifier_dl         y                                               text\n",
              "0        Sports     World   John Kerry, Bob Kerrey. It's easy to get conf...\n",
              "1      Sci/Tech  Sci/Tech  Black Box Voting hopes to halt the use of Dieb...\n",
              "2      Sci/Tech  Sci/Tech  Half of Viagra tablets sold on the Internet ar...\n",
              "3      Sci/Tech     World  Walking may protect the elderly from developin...\n",
              "4      Sci/Tech  Sci/Tech  source contribution is the first time it has s..."
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-b0718ee8-1662-4859-ae2f-2133c9ab49a0\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>classifier_dl</th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Sports</td>\n",
              "      <td>World</td>\n",
              "      <td>John Kerry, Bob Kerrey. It's easy to get conf...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Black Box Voting hopes to halt the use of Dieb...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Half of Viagra tablets sold on the Internet ar...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>World</td>\n",
              "      <td>Walking may protect the elderly from developin...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>source contribution is the first time it has s...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b0718ee8-1662-4859-ae2f-2133c9ab49a0')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-b0718ee8-1662-4859-ae2f-2133c9ab49a0 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-b0718ee8-1662-4859-ae2f-2133c9ab49a0');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 42
        }
      ],
      "source": [
        "# load a trainable pipeline by specifying the <train.> prefix and fit it on a datset with column named <y> and <text> columns\n",
        "augmented_fitted_classifier = unfitted_classifier.fit(augmented_train_df)\n",
        "\n",
        "# predict with the trainable pipeline on dataset and get predictions\n",
        "# aug_train_preds = augmented_fitted_classifier.predict(augmented_train_df,output_level = 'document')\n",
        "aug_train_preds = augmented_fitted_classifier.predict(train_df,output_level = 'document')\n",
        "# Train Predictions\n",
        "aug_train_preds[['classifier_dl','y','text']].head(5)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eAG0960lCbc2",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        " ### Evaluate Train Metrics for Augmented Model"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Get Metrics on Train vanilla train dataset with augmented model and compare with metrics of vanilla model on vanilla datset\n",
        "train_metrics_augmented = classification_report(aug_train_preds['y'], aug_train_preds['classifier_dl'])\n",
        "\n",
        "sep = '_'*50\n",
        "\n",
        "print(sep, 'Metrics on Train dataset with  Vanilla Model',sep)\n",
        "print(train_metrics_not_augmented)\n",
        "\n",
        "print(sep,'Metrics on Train dataset with Augmented Model',sep)\n",
        "print(train_metrics_augmented)\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "OLmZw-iwW8r6",
        "outputId": "b5ecd937-98da-49f7-aa0a-19985397ac77",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "__________________________________________________ Metrics on Train dataset with  Vanilla Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.72      0.61      0.66       133\n",
            "    Sci/Tech       0.72      0.77      0.74       155\n",
            "      Sports       0.80      0.95      0.87       138\n",
            "       World       0.83      0.73      0.78       134\n",
            "\n",
            "    accuracy                           0.77       560\n",
            "   macro avg       0.77      0.76      0.76       560\n",
            "weighted avg       0.77      0.77      0.76       560\n",
            "\n",
            "__________________________________________________ Metrics on Train dataset with Augmented Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.89      0.54      0.67       100\n",
            "    Sci/Tech       0.70      0.81      0.75       100\n",
            "      Sports       0.87      0.94      0.90       100\n",
            "       World       0.75      0.86      0.80       100\n",
            "\n",
            "    accuracy                           0.79       400\n",
            "   macro avg       0.80      0.79      0.78       400\n",
            "weighted avg       0.80      0.79      0.78       400\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dkMkOz40Cg1O",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "\n",
        "### Evaluate Test Metrics for Augmented Model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qESdglTwCEW4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 423
        },
        "outputId": "d7db73f4-3edb-428e-fb09-db5be95369f7",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "     classifier_dl         y  \\\n",
              "0         Business  Business   \n",
              "1         Sci/Tech  Business   \n",
              "2         Business  Business   \n",
              "3         Business  Business   \n",
              "4         Business  Business   \n",
              "...            ...       ...   \n",
              "3595         World     World   \n",
              "3596         World     World   \n",
              "3597         World     World   \n",
              "3598      Business     World   \n",
              "3599      Business     World   \n",
              "\n",
              "                                                   text  \n",
              "0     California lawyers who reached a \\$1.1 billion...  \n",
              "1     DreamWorks SKG, the studio that created the  q...  \n",
              "2     Nike Inc. (NKE.N: Quote, Profile, Research) on...  \n",
              "3     Automaker DaimlerChrysler AG said Wednesday it...  \n",
              "4     Online holiday shoppers this year are making c...  \n",
              "...                                                 ...  \n",
              "3595   KINSHASA, Congo (AP)   Attackers overran a sl...  \n",
              "3596   Millions of French students returned to schoo...  \n",
              "3597   Tammy Hough is a life long Republican, a soci...  \n",
              "3598  Many of Johnny Cash's possessions were sold at...  \n",
              "3599  Give the guy some credit. Tung Chee-hwa, Hong ...  \n",
              "\n",
              "[3600 rows x 3 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-06d4add8-6762-4b4a-8bc0-fcd1dc1a2888\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>classifier_dl</th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>California lawyers who reached a \\$1.1 billion...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Business</td>\n",
              "      <td>DreamWorks SKG, the studio that created the  q...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>Nike Inc. (NKE.N: Quote, Profile, Research) on...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>Automaker DaimlerChrysler AG said Wednesday it...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Business</td>\n",
              "      <td>Business</td>\n",
              "      <td>Online holiday shoppers this year are making c...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3595</th>\n",
              "      <td>World</td>\n",
              "      <td>World</td>\n",
              "      <td>KINSHASA, Congo (AP)   Attackers overran a sl...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3596</th>\n",
              "      <td>World</td>\n",
              "      <td>World</td>\n",
              "      <td>Millions of French students returned to schoo...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3597</th>\n",
              "      <td>World</td>\n",
              "      <td>World</td>\n",
              "      <td>Tammy Hough is a life long Republican, a soci...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3598</th>\n",
              "      <td>Business</td>\n",
              "      <td>World</td>\n",
              "      <td>Many of Johnny Cash's possessions were sold at...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3599</th>\n",
              "      <td>Business</td>\n",
              "      <td>World</td>\n",
              "      <td>Give the guy some credit. Tung Chee-hwa, Hong ...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>3600 rows × 3 columns</p>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-06d4add8-6762-4b4a-8bc0-fcd1dc1a2888')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-06d4add8-6762-4b4a-8bc0-fcd1dc1a2888 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-06d4add8-6762-4b4a-8bc0-fcd1dc1a2888');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 44
        }
      ],
      "source": [
        "# Lets predict on original test_df with the augmented pipe and see if our accuracy improved   \n",
        "augmented_test_preds = augmented_fitted_classifier.predict(test_df,output_level = 'document')\n",
        "test_augmented_evaluation = classification_report(augmented_test_preds['y'], augmented_test_preds['classifier_dl'])\n",
        "# Test Prediction\n",
        "augmented_test_preds[['classifier_dl','y','text']]"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "sep = '_'*50\n",
        "print(sep,'Metrics on Test dataset, for model which was trained on Vanilla Dataset:',sep)\n",
        "print(test_augmented_evaluation)\n",
        "print(sep, 'Metrics on Test dataset, for model which was trained on Augmented Dataset:',sep)\n",
        "print(test_metrics_not_augmented)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "XQtA29yCXwbe",
        "outputId": "df6389f0-5280-49c0-de6b-51987b153cae",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "__________________________________________________ Metrics on Test dataset, for model which was trained on Vanilla Dataset: __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.84      0.43      0.57       900\n",
            "    Sci/Tech       0.68      0.78      0.73       900\n",
            "      Sports       0.89      0.91      0.90       900\n",
            "       World       0.67      0.89      0.77       900\n",
            "\n",
            "    accuracy                           0.75      3600\n",
            "   macro avg       0.77      0.75      0.74      3600\n",
            "weighted avg       0.77      0.75      0.74      3600\n",
            "\n",
            "__________________________________________________ Metrics on Test dataset, for model which was trained on Augmented Dataset: __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.75      0.75      0.75       900\n",
            "    Sci/Tech       0.80      0.69      0.74       900\n",
            "      Sports       0.86      0.93      0.89       900\n",
            "       World       0.80      0.85      0.82       900\n",
            "\n",
            "    accuracy                           0.80      3600\n",
            "   macro avg       0.80      0.80      0.80      3600\n",
            "weighted avg       0.80      0.80      0.80      3600\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Let's create some generic functions and test data augmentation on other datasets"
      ],
      "metadata": {
        "id": "xNU4rY8JX5pC",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Dataset Creation Function"
      ],
      "metadata": {
        "id": "x18aWCyH0Y81",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "## full Configuration for a Data Augmentation Experiment\n",
        "# Play with these values and see how they influence accuracy\n",
        "import nlu\n",
        "# For Print Logs\n",
        "sep = '_'*50\n",
        "\n",
        "\n",
        "# Base Train Dataset Settings\n",
        "n_per_class_total = 300 \n",
        "train_test_frac=0.9\n",
        "label_col ='y'\n",
        "\n",
        "# Train Settings\n",
        "embeddings_to_use=''\n",
        "epochs=5\n",
        "learn_rate=0.00005\n",
        "batch_size=32\n",
        "droput=0.5\n",
        "\n",
        "\n",
        "# Text Generation Settings\n",
        "prompt_label_prefix='-News-Headline:'\n",
        "slice_len=15\n",
        "num_aug_per_label=25\n",
        "gpt2_pipe = nlu.load('gpt2')\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(30)\n",
        "gpt2_pipe['document_assembler'].setCleanupMode('shrink')\n",
        "\n",
        "learn_rate=0.0005"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mzKwHgywp0On",
        "outputId": "bd4a41c0-a302-4ce5-ae89-b28988c9c90d",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "gpt2 download started this may take some time.\n",
            "Approximate size to download 442.7 MB\n",
            "[OK!]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd \n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.metrics import classification_report\n",
        "from sklearn.model_selection import train_test_split\n",
        "def get_base_train_test_dataset(base_dataset:pd.DataFrame,\n",
        "                                n_per_class_total = 1000 ,\n",
        "                                train_test_frac=0.8,\n",
        "                                label_col='y'\n",
        "                                ):\n",
        "  \"\"\"Create dataset where labels in train/test are eqeually distributed.\n",
        "  \n",
        "  \"\"\"\n",
        "  base_dataset = base_dataset.groupby(label_col).head(n_per_class_total)\n",
        "  train_dfs = []\n",
        "  test_dfs = []\n",
        "  for label in base_dataset[label_col].unique():\n",
        "    train_df, test_df = train_test_split(base_dataset[base_dataset[label_col] == label],test_size = train_test_frac)\n",
        "    test_dfs.append(test_df)\n",
        "    train_dfs.append(train_df)\n",
        "\n",
        "\n",
        "  test_df = pd.concat(test_dfs)\n",
        "  train_df = pd.concat(train_dfs)\n",
        "\n",
        "  return test_df, train_df\n",
        "\n"
      ],
      "metadata": {
        "id": "ysejn1itX5Fq",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Fit and Evaluate Function"
      ],
      "metadata": {
        "id": "7c3gosdo0cpl",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def fit_and_evaluate(untrained_model, train_df, test_df, label_col ='y',\n",
        "                     epochs=5, learn_rate=0.005, batch_size=64, droput=0.5\n",
        "                     ):\n",
        "  \"\"\" \n",
        "  1. Fit Model train data\n",
        "  2. Gets train metrics\n",
        "  3. Gets test metrics\n",
        "  \"\"\"\n",
        "  train_df = train_df.rename(columns ={label_col:'y'})\n",
        "  print(f'Training Settings : Epochs={epochs}, learn_rate={learn_rate}, batch_size={batch_size}, dropout={droput}')\n",
        "\n",
        "  untrained_model['trainable_classifier_dl'].setMaxEpochs(epochs)\n",
        "  untrained_model['trainable_classifier_dl'].setLr(learn_rate)\n",
        "  untrained_model['trainable_classifier_dl'].setBatchSize(batch_size)\n",
        "  untrained_model['trainable_classifier_dl'].setDropout(droput)\n",
        "  fitted_classifier = untrained_model.fit(train_df)\n",
        "  train_preds = fitted_classifier.predict(train_df)\n",
        "  train_preds['y'] = train_preds['y'].astype(str)\n",
        "  train_preds['classifier_dl'] = train_preds['classifier_dl'].astype(str)\n",
        "\n",
        "  test_metrics = classification_report(train_preds['y'], train_preds['classifier_dl'])\n",
        "  test_preds = fitted_classifier.predict(test_df,output_level = 'document')\n",
        "  test_preds[['classifier_dl','y','text']].head(5)\n",
        "  train_metrics = classification_report(test_preds['y'], test_preds['classifier_dl'])\n",
        "  return fitted_classifier, test_metrics, train_metrics\n"
      ],
      "metadata": {
        "id": "5DrukyvtgVtQ",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Create Augmented Dataset Function"
      ],
      "metadata": {
        "id": "RNZL-g9m0iMB",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def check_enough_data_for_generation(train_df,num_aug_per_label):\n",
        "  # Before creating dataset, we check if there are enough instances to generate frorm\n",
        "  for label in train_df.y.unique():\n",
        "    num_samples_for_label = len(train_df[train_df.y==label])\n",
        "    if num_samples_for_label <  num_aug_per_label : \n",
        "      raise ValueError(f\"\"\"Your Dataset has to few examples for generation.\n",
        "      For label={label} only {num_samples_for_label} exampples exist in the training dataset but {num_aug_per_label} should be generated.\n",
        "      Please fix this by increasing dataset size or train/test split or n_per_class_total parameters\"\"\")\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "def creat_augmented_dataset(generator_model, train_df, slice_len=15, num_aug_per_label=10,\n",
        "                            label_col = 'y',\n",
        "                            prompt_label_prefix = '-News:'\n",
        "                            ):\n",
        "  \"\"\"Create Augmented Dataset with a generator Model using simple Label prefixing and data slicing\"\"\"\n",
        "  train_df = train_df.rename(columns ={label_col:'y'})\n",
        "  train_df['text_len'] = train_df.text.str.len()\n",
        "  train_df = train_df.sort_values('text_len')\n",
        "  train_df['base_context'] =  train_df['y']+ prompt_label_prefix + train_df['text'].str.slice(0,slice_len)\n",
        "\n",
        "  aug_data = []\n",
        "  num_labels = len(train_df.y.unique())\n",
        "\n",
        "  # Create dataset\n",
        "  for i, label in enumerate(train_df.y.unique()):\n",
        "    print(sep,f'Generating for label={label} No. {i+1}/{num_labels}',sep)\n",
        "    prompt_df = train_df[train_df.y==label]\n",
        "    for n in range(num_aug_per_label):\n",
        "      print(sep,f'Generating new training data example {n+1}/{num_aug_per_label} for prompt : <{prompt_df.iloc[n].base_context}>',sep)\n",
        "      text = prompt_df.iloc[n].base_context\n",
        "      generation = generator_model.predict(text) \n",
        "      aug_data.append(generation)\n",
        "      print('Generated : \\n',generation['generated'].iloc[0])\n",
        "  # Remove the <label> part from the generated text, so we are not leaking the label\n",
        "  aug_df = pd.concat(aug_data)\n",
        "  aug_df['y'] = aug_df.document.apply(lambda x : x.split(prompt_label_prefix)[0])\n",
        "  aug_df['text'] = aug_df.generated.apply(lambda x : x.split(prompt_label_prefix)[1])\n",
        "  aug_df = aug_df[['y','text']]\n",
        "  augmented_train_df = train_df.append(aug_df)\n",
        "  return augmented_train_df"
      ],
      "metadata": {
        "id": "1LMT9JFIhkd0",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Compare Vanilla with Augmentec Training Function"
      ],
      "metadata": {
        "id": "_qrI29Eq0l1P",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def compare_vanilla_and_augmented_training(\n",
        "  base_dataset:pd.DataFrame,\n",
        "  generator_model,\n",
        "  embeddings_to_use='',\n",
        "  n_per_class_total = 1000 ,\n",
        "  train_test_frac=0.8,\n",
        "  label_col ='y',\n",
        "  prompt_label_prefix = '-News:',\n",
        "  slice_len=15,\n",
        "  num_aug_per_label=10,\n",
        "  epochs=5, learn_rate=0.005, batch_size=64, droput=0.5\n",
        "  ):\n",
        "  \"\"\"Trains and Compares : \n",
        "  1. Vanilla Classifier Train&Test Metrics\n",
        "  2. Augmented Classifier Train&Test Metrics\n",
        "  Only for Sequence Classification Problems, using Spark NLP's ClassifierDL\n",
        "  \"\"\"\n",
        "\n",
        "  print(sep,'Starting Data Augmentation Experiment',sep)\n",
        "  test_df, train_df = get_base_train_test_dataset(\n",
        "      base_dataset=base_dataset,\n",
        "      n_per_class_total=n_per_class_total,\n",
        "      train_test_frac=train_test_frac,\n",
        "      label_col=label_col,\n",
        "      )\n",
        "  print(f'Train Dataset Size = {train_df.shape[0]}')\n",
        "  print(f'Test Dataset Size = {test_df.shape[0]}')\n",
        "  # Before creating dataset, we check if there are enough instances to generate frorm\n",
        "  check_enough_data_for_generation(train_df, num_aug_per_label)\n",
        "\n",
        "\n",
        "  print('Training Vanilla Classifier')\n",
        "  no_aug_fitted_classifier, no_aug_test_metrics, no_aug_train_metrics = fit_and_evaluate(\n",
        "      untrained_model=nlu.load(f'train.classifier{embeddings_to_use}'),\n",
        "      train_df=train_df,\n",
        "      test_df=test_df,\n",
        "      label_col=label_col,\n",
        "      epochs=epochs, learn_rate=learn_rate, batch_size=batch_size, droput=droput)\n",
        "  print(sep,'Metrics on Train datataset with Vanilla Model',sep)\n",
        "  print(no_aug_train_metrics)\n",
        "  print(sep,'Metrics on Test Dataset with Vanilla Model',sep)\n",
        "  print(no_aug_test_metrics)\n",
        "  print(sep*2)\n",
        "\n",
        "\n",
        "  # Load Generator, configure and Create Augmented Dataset \n",
        "  #Note: You could simply replace the creat_augmented_dataset() method here with your own method here and evaluate other augmentation techniques\n",
        "  augmented_train_df = creat_augmented_dataset(\n",
        "      generator_model=generator_model,\n",
        "      train_df=train_df,\n",
        "      slice_len=slice_len,\n",
        "      num_aug_per_label=num_aug_per_label,\n",
        "      label_col=label_col,\n",
        "      prompt_label_prefix=prompt_label_prefix,)\n",
        "  print(f'Augmented Training Dataset Size = {augmented_train_df.shape[0]}')\n",
        "\n",
        "  print('Training Augmented Classifierr')\n",
        "  ## Train Augmented classifier on Augmented Dataset\n",
        "  aug_fitted_classifier, aug_test_metrics, aug_train_metrics = fit_and_evaluate(\n",
        "      untrained_model=nlu.load(f'train.classifier{embeddings_to_use}'),\n",
        "      train_df=augmented_train_df,\n",
        "      test_df=test_df,\n",
        "      label_col=label_col,\n",
        "      epochs=epochs, learn_rate=learn_rate, batch_size=batch_size, droput=droput)\n",
        "\n",
        "  \n",
        "  \n",
        "  # Get Metrics on Vanilla Dataset with Aug Classifier\n",
        "  aug_vanilla_preds = aug_fitted_classifier.predict(train_df)\n",
        "  aug_vanilla_preds['y'] = aug_vanilla_preds['y'].astype(str)\n",
        "  aug_vanilla_preds['classifier_dl'] = aug_vanilla_preds['classifier_dl'].astype(str)\n",
        "  aug_vanilla_train_metrics = classification_report(aug_vanilla_preds['y'], aug_vanilla_preds['classifier_dl'])\n",
        "\n",
        "\n",
        "\n",
        "  print(sep,'Metrics on vanilla Train datataset with AUGMENTED Model',sep)\n",
        "  print(aug_vanilla_train_metrics)\n",
        "  print(sep*2)\n",
        "\n",
        "  print(sep,'Metrics on vanilla Train datataset with VANILLA Model',sep)\n",
        "  print(no_aug_train_metrics)\n",
        "  print(sep*2)\n",
        "\n",
        "  print(sep,'Metrics on Test datataset with AUGMENTED Model',sep)\n",
        "  print(aug_test_metrics)\n",
        "  print(sep*2)\n",
        "\n",
        "  print(sep,'Metrics on Test datataset with VANILLA Model',sep)\n",
        "  print(no_aug_test_metrics)\n",
        "  print(sep*2)\n",
        "\n",
        "  print(sep,'Metrics on Augmented Train datataset with AUGMENTED Model',sep)\n",
        "  print(aug_train_metrics)\n",
        "  print(sep*3)\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "  return no_aug_fitted_classifier, aug_fitted_classifier\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "xzon8dJ5fqLp",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Let's test out the functions on a dataset"
      ],
      "metadata": {
        "id": "ipFVkCn00rur",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv\n",
        "import pandas as pd\n",
        "import pandas as pd \n",
        "import nlu\n",
        "base_df = pd.read_csv('/content/news_category_test.csv').iloc[:10000]\n",
        "base_df.columns=['y','text']\n",
        "base_df.y.value_counts().plot.barh(title='Label Distribution')\n",
        "base_df.head(5)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 673
        },
        "id": "DuRTooCZ0vn3",
        "outputId": "836e657a-993f-4d28-af42-1fbdf2f7bb89",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-09-04 11:38:32--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv\n",
            "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.80.6\n",
            "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.80.6|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 1504408 (1.4M) [text/csv]\n",
            "Saving to: ‘news_category_test.csv.1’\n",
            "\n",
            "news_category_test. 100%[===================>]   1.43M  2.77MB/s    in 0.5s    \n",
            "\n",
            "2022-09-04 11:38:33 (2.77 MB/s) - ‘news_category_test.csv.1’ saved [1504408/1504408]\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "          y                                               text\n",
              "0  Business  Unions representing workers at Turner   Newall...\n",
              "1  Sci/Tech   TORONTO, Canada    A second team of rocketeer...\n",
              "2  Sci/Tech   A company founded by a chemistry researcher a...\n",
              "3  Sci/Tech   It's barely dawn when Mike Fitzpatrick starts...\n",
              "4  Sci/Tech   Southern California's smog fighting agency we..."
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-57c7e55d-3524-4c6f-89b1-2082de76cdd7\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>y</th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Business</td>\n",
              "      <td>Unions representing workers at Turner   Newall...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>TORONTO, Canada    A second team of rocketeer...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>A company founded by a chemistry researcher a...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>It's barely dawn when Mike Fitzpatrick starts...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Sci/Tech</td>\n",
              "      <td>Southern California's smog fighting agency we...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-57c7e55d-3524-4c6f-89b1-2082de76cdd7')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-57c7e55d-3524-4c6f-89b1-2082de76cdd7 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-57c7e55d-3524-4c6f-89b1-2082de76cdd7');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 50
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZAAAAEICAYAAABxiqLiAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAWP0lEQVR4nO3de5RlZX3m8e9jg40INGCT2GnBEkQURRHaCwwgoy5Bwetg8AqYzDBmJSbGaILjijLJMINjYlhoMgYNooiK19EBFIwzeAFRurGhIXJTm5EW5aYNCCK0v/njvBUPNVXdXW9X1anu/n7WOuvs8+693/3b+3Sfp969T+1KVSFJ0nQ9bNQFSJI2TwaIJKmLASJJ6mKASJK6GCCSpC4GiCSpiwGirUqSi5P8+7let61/aJLretefpL8vJTm+TZ+Q5Jsz2Pdrk1w0U/1py2SAaLOUZHWS54+6jnFJTk7yQJK72+P6JO9PsmR8mar6RlXts5F9fWxDy1XVC6vqIzNQ+1iSSrLNUN/nVNULNrVvbdkMEGnmnFtVOwK7Ai8HHg2sGA6RmZAB/+9q5PxHqC1Kkl2SnJfktiQ/a9OPmbDYXkm+k+SuJF9IsuvQ+s9OcmmSnye5Msnh062hqh6oqmuAY4HbgD9rfR+e5Oahbf1FkjVtxHJdkuclORL4T8CxSe5JcmVb9uIkpyS5BLgX2HOSU2ppo561Sa5N8ryhGQ8ZsU0Y5Xy9Pf+8bfOgiafEkhyc5PLW9+VJDh6ad3GSv05ySduXi5Isnu5x0+bHANGW5mHAh4HHAnsA9wHvn7DMccDvAUuAB4HTAZIsBc4H/guDUcRbgc8m2a2nkKpaB3wBOHTivCT7AH8EPKONWo4AVlfVl4H/ymA0s0NVPW1otdcDJwI7AjdNsslnAd8HFgPvAj43HI7rcVh73rlt81sTat2VwXE5HXgU8F7g/CSPGlrsNcAbgN8CHs7g2GkLZ4Boi1JVd1TVZ6vq3qq6GzgFeM6Exc6uqqur6hfAXwK/m2QB8Drggqq6oKp+XVVfAZYDL9qEkn7MIIwmWgcsBPZNsm1Vra6q72+gr7Oq6pqqerCqHphk/q3AaW0EdC5wHXDUJtQ+7ijghqo6u237E8C1wIuHlvlwVV1fVfcBnwL2n4Htap4zQLRFSbJ9kn9MclOSuxicntm5BcS4Hw1N3wRsy+Cn9scCr2ynr36e5OfAIQxGKr2WAndObKyqG4E3AycDtyb5ZJLf2UBfP9rA/DX10Luj3gRsqM+N8Tv8/yOemxjs27ifDE3fC+wwA9vVPGeAaEvzZ8A+wLOqaid+c3omQ8vsPjS9B/AAcDuDD+izq2rnoccjq+rUnkLahe4XA9+YbH5VfbyqDmEQXAW8e3zWFF1u6NbZS5MM7+ceDEZAAL8Ath+a9+hp9PvjVuOwPYA1G1hPWzgDRJuzbZNsN/TYhsH1gfsYXBDelcG1gIlel2TfJNsDfwV8pl2v+Bjw4iRHJFnQ+jx8kovw65VkmyRPAj7B4IP6vZMss0+S5yZZCPyy1fzrNvunwFjHN61+C/jjJNsmeSXwJOCCNm8l8Ko2bxlwzNB6t7Vt7zlFvxcAT0jymrZvxwL7AudNsz5tYQwQbc4uYPDBO/44GTgNeASDEcVlwJcnWe9s4CwGp122A/4YoKp+BLyUwbegbmMwInkbG///5Ngk9wBrgS8CdwAHVtWPJ1l2IXBqq/MnDD78397mfbo935Hkio3cNsC3gb1bn6cAx1TVHW3eXwJ7AT8D/jPw8fGVquretvwl7dTds4c7bX0czWB0dwfw58DRVXX7NGrTFij+QSlJUg9HIJKkLgaIJKmLASJJ6mKASJK6bLPhRbYcixcvrrGxsVGXIUmbjRUrVtxeVZPezmerCpCxsTGWL18+6jIkabORZLL7rgGewpIkdTJAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktRlq7qZ4qo1axk76fxRlyFJc2b1qUfNWt+OQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldZjVAkvxdkjcPvb4wyYeGXv9tkrdsZF9nJTlmkvbDk5w3MxVLkjbWbI9ALgEOBkjyMGAx8OSh+QcDl26okyQLZqU6SVK32Q6QS4GD2vSTgauBu5PskmQh8CRgUZLvJlmV5MzWTpLVSd6d5ArglcOdJjkyybVt3itmeR8kSZOY1QCpqh8DDybZg8Fo41vAtxmEyjLgBuBDwLFVtR+Dmzv+wVAXd1TVAVX1yfGGJNsBHwReDBwIPHo290GSNLm5uIh+KYPwGA+Qbw29vhn4YVVd35b9CHDY0LrnTtLfE9s6N1RVAR9b38aTnJhkeZLl6+5du2l7Ikn6V3MRIOPXQfZjcArrMgYjkIOBizew7i82deNVdUZVLauqZQu2X7Sp3UmSmrkagRwN3FlV66rqTmBnBiHyWWAsyePbsq8HvraB/q5t6+zVXr96FmqWJG3AXATIKgbfvrpsQtvaqroZeAPw6SSrgF8DH1hfZ1X1S+BE4Px2Ef3WWalakrRes/4XCatqHbDThLYThqa/Cjx9kvXG1rPOlxlcC5EkjYi/iS5J6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqcus/yLhfLLf0kUsP/WoUZchSVsERyCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpyzajLmAurVqzlrGTzh91GZI0Z1afetSs9e0IRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdZmVAEnyjiTXJLkqycokz5qBPg9PcvBM1CdJ2nQzfi+sJAcBRwMHVNX9SRYDD9/EPrcBDgfuAS7d5CIlSZtsNm6muAS4varuB6iq2wGSrAY+BbwQuA94TVXdmGQMOBNYDNwGvKGq/m+Ss4BfAk8H1gAHA+uSvA54E/Bo4F3AOmBtVR02C/siSZrCbJzCugjYPcn1Sf4hyXOG5q2tqv2A9wOntbb3AR+pqqcC5wCnDy3/GODgqnoF8AHg76pq/6r6BvBO4IiqehrwkqmKSXJikuVJlq+7d+2M7aQkbe1mPECq6h7gQOBEBiOKc5Oc0GZ/Yuj5oDZ9EPDxNn02cMhQd5+uqnVTbOoS4Kwk/wFYsJ56zqiqZVW1bMH2i6a7O5KkKczK3wNpH/oXAxcnWQUcPz5reLGN6OoX69nGG9vF+aOAFUkOrKo7OkuWJE3TjI9AkuyTZO+hpv2Bm9r0sUPP32rTlwKvatOvBb4xRdd3AzsObWevqvp2Vb2TwUhn9xkoX5K0kWZjBLID8L4kOwMPAjcyOJ11NLBLkquA+4FXt+XfBHw4ydtoF9Gn6Pd/AZ9J8tK2zp+2oArwVeDKWdgXSdIUZjxAqmoFg29MPUQSgPdU1V9MWP4m4LmT9HPChNfXA08dappqpCJJmgP+JrokqcusXESfTFWNzdW2JEmzzxGIJKmLASJJ6mKASJK6GCCSpC4GiCSpy5x9C2s+2G/pIpafetSoy5CkLYIjEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkdTFAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkddlm1AXMpVVr1jJ20vmjLkOS5szqU4+atb4dgUiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6dAVIknckuSbJVUlWJnnWFMstS3L60Ottk/ywrbMyyU+SrBl6/fCN3P7hSc7rqV2SNDOmfSuTJAcBRwMHVNX9SRYDk37wV9VyYPlQ0yHAeVX1ptbXycA9VfU3061DkjRaPSOQJcDtVXU/QFXdXlU/TvKMJJcmuTLJd5LsOMlI4UjgS5N1muTAJF9LsiLJhUmWtPbHJ/nn1u8VSfZqq+yQ5DNJrk1yTpJ07IskqVNPgFwE7J7k+iT/kOQ57dTTucCfVNXTgOcD902y7r8FLp7YmGRb4H3AMVV1IHAmcEqbfQ7w963fg4FbWvvTgTcD+wJ7Av9msmKTnJhkeZLl6+5d27G7kqTJTPsUVlXdk+RA4FAGgXAugw/7W6rq8rbMXQDDg4IkS4E7q+reSbrdB3gK8JW2zgLgliQ7Akur6vOt318O9fudqrq5vV4JjAHfnKTeM4AzABYu2bumu7+SpMl13c69qtYxGElcnGQV8IcbsdqRwIVTzAtwTVUd9JDGQYBM5f6h6XVsZbeml6RRm/YprCT7JNl7qGl/4HvAkiTPaMvsmGTiB/qU1z+A64Dd2gX68W9rPbmq7gZuTvKy1r4wyfbTrVmSNPN6fmrfAXhfkp2BB4EbgROBD7f2RzC4/vH88RWSLAAeX1XXTtZhVf0qyTHA6UkWtbpOA64BXg/8Y5K/Ah4AXtlRsyRphqVq9i8LJDkEeF1VvXHWN7YeC5fsXUuOP22UJUjSnNrUv0iYZEVVLZts3pxcN6iqbzLJBW5J0ubLW5lIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpy1Z1/6j9li5i+Sb+Uo0kacARiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6GCCSpC4GiCSpiwEiSeqyzagLmEur1qxl7KTzR12GJM2Z1aceNWt9OwKRJHUxQCRJXQwQSVIXA0SS1MUAkSR1MUAkSV0MEElSFwNEktTFAJEkddlggCRZl2RlkiuTXJHk4J4NJXljkuN61pUkzT8bcyuT+6pqf4AkRwD/DXjOdDdUVR+Y7jqSpPlruqewdgJ+BpDk8CTnjc9I8v4kJ7TpU5P8S5KrkvxNazs5yVvb9MVJ3p3kO0muT3Joa1+Q5D1JLm/r/sfWviTJ19tI6Ookh7Zlz2qvVyX5000+GpKkjbYxI5BHJFkJbAcsAZ67voWTPAp4OfDEqqokO0+17ap6ZpIXAe8Cng/8PrC2qp6RZCFwSZKLgFcAF1bVKUkWANsD+wNLq+opbbtTbUeSNAumewrrIOCjSZ6ynuXXAr8E/qmNUM6bYrnPtecVwFibfgHw1CTHtNeLgL2By4Ezk2wL/M+qWpnkB8CeSd4HnA9cNNlGkpwInAiwYKfdNrSvkqSNNK1TWFX1LWAxsBvw4IT1t2vLPAg8E/gMcDTw5Sm6u789r+M3QRbgTVW1f3s8rqouqqqvA4cBa4CzkhxXVT8DngZcDLwR+NAUNZ9RVcuqatmC7RdNZ3clSesxrb8HkuSJwALgDuAmYN92qukRwPOAbybZAdi+qi5Icgnwg2ls4kLgD5L876p6IMkTGITGYuDmqvpg294BSS4AflVVn01yHfCx6eyLJGnTTOcaCAxGCMdX1TrgR0k+BVwN/BD4bltmR+ALSbZry79lGvV8iMHprCuSBLgNeBlwOPC2JA8A9wDHAUuBDycZHwW9fRrbkSRtolTVqGuYMwuX7F1Ljj9t1GVI0pzZ1L9ImGRFVS2bbJ6/iS5J6mKASJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqcu0bmWyudtv6SKWb+Iv1UiSBhyBSJK6GCCSpC4GiCSpiwEiSepigEiSuhggkqQuBogkqYsBIknqYoBIkroYIJKkLgaIJKmLASJJ6mKASJK6pKpGXcOcSXI3cN2o69gIi4HbR13EBmwONYJ1zjTrnFmbQ52PrardJpuxVd3OHbiuqpaNuogNSbJ8vte5OdQI1jnTrHNmbS51TsVTWJKkLgaIJKnL1hYgZ4y6gI20OdS5OdQI1jnTrHNmbS51TmqruoguSZo5W9sIRJI0QwwQSVKXrSJAkhyZ5LokNyY5acS17J7k/yT5lyTXJPmT1n5ykjVJVrbHi4bWeXur/bokR8xhrauTrGr1LG9tuyb5SpIb2vMurT1JTm91XpXkgDmqcZ+hY7YyyV1J3jwfjmeSM5PcmuTqobZpH78kx7flb0hy/BzV+Z4k17ZaPp9k59Y+luS+oeP6gaF1Dmz/Xm5s+5I5qHPa7/Nsfx5MUee5QzWuTrKytY/seM6IqtqiH8AC4PvAnsDDgSuBfUdYzxLggDa9I3A9sC9wMvDWSZbft9W8EHhc25cFc1TramDxhLb/DpzUpk8C3t2mXwR8CQjwbODbI3qvfwI8dj4cT+Aw4ADg6t7jB+wK/KA979Kmd5mDOl8AbNOm3z1U59jwchP6+U6rPW1fXjgHdU7rfZ6Lz4PJ6pww/2+Bd476eM7EY2sYgTwTuLGqflBVvwI+Cbx0VMVU1S1VdUWbvhv4HrB0Pau8FPhkVd1fVT8EbmSwT6PyUuAjbfojwMuG2j9aA5cBOydZMse1PQ/4flXdtJ5l5ux4VtXXgTsn2f50jt8RwFeq6s6q+hnwFeDI2a6zqi6qqgfby8uAx6yvj1brTlV1WQ0+/T7Kb/Zt1upcj6ne51n/PFhfnW0U8bvAJ9bXx1wcz5mwNQTIUuBHQ69vZv0f2HMmyRjwdODbremP2imDM8dPbTDa+gu4KMmKJCe2tt+uqlva9E+A327T8+E4v4qH/secb8cTpn/8Rl0vwO8x+Al43OOSfDfJ15Ic2tqWttrGzWWd03mfR308DwV+WlU3DLXNt+O50baGAJmXkuwAfBZ4c1XdBfwPYC9gf+AWBsPcUTukqg4AXgj8YZLDhme2n4zmxffAkzwceAnw6dY0H4/nQ8yn4zeVJO8AHgTOaU23AHtU1dOBtwAfT7LTqOpjM3ifJ3g1D/0hZ74dz2nZGgJkDbD70OvHtLaRSbItg/A4p6o+B1BVP62qdVX1a+CD/Oa0ysjqr6o17flW4POtpp+On5pqz7eOus7mhcAVVfVTmJ/Hs5nu8RtZvUlOAI4GXtvCjnZK6I42vYLB9YQntJqGT3PNSZ0d7/Moj+c2wCuAc8fb5tvxnK6tIUAuB/ZO8rj2U+qrgC+Oqph2DvSfgO9V1XuH2oevF7wcGP8GxxeBVyVZmORxwN4MLq7Ndp2PTLLj+DSDi6pXt3rGvwl0PPCFoTqPa98mejawduhUzVx4yE928+14Dpnu8bsQeEGSXdrpmRe0tlmV5Ejgz4GXVNW9Q+27JVnQpvdkcPx+0Gq9K8mz27/x44b2bTbrnO77PMrPg+cD11bVv56amm/Hc9pGfRV/Lh4MvuFyPYN0f8eIazmEwWmLq4CV7fEi4GxgVWv/IrBkaJ13tNqvY46+icHgWypXtsc148cNeBTwVeAG4J+BXVt7gL9vda4Cls3hMX0kcAewaKht5MeTQaDdAjzA4Bz27/ccPwbXIG5sjzfMUZ03MrhWMP5v9ANt2X/X/j2sBK4AXjzUzzIGH+DfB95Pu9PFLNc57fd5tj8PJquztZ8FvHHCsiM7njPx8FYmkqQuW8MpLEnSLDBAJEldDBBJUhcDRJLUxQCRJHUxQCRJXQwQSVKX/wcy/pfJ2B4QoQAAAABJRU5ErkJggg==\n"
          },
          "metadata": {
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "## full Configuration for a Data Augmentation Experiment\n",
        "# Play with these values and see how they influence accuracy\n",
        "\n",
        "# For Print Logs\n",
        "sep = '_'*50\n",
        "\n",
        "\n",
        "# Base Train Dataset Settings\n",
        "base_dataset=base_df\n",
        "n_per_class_total = 300 \n",
        "train_test_frac=0.9\n",
        "label_col ='y'\n",
        "\n",
        "# Train Settings\n",
        "embeddings_to_use=''\n",
        "epochs=5\n",
        "learn_rate=0.00005\n",
        "batch_size=32\n",
        "droput=0.5\n",
        "\n",
        "\n",
        "# Text Generation Settings\n",
        "prompt_label_prefix='-News-Headline:'\n",
        "slice_len=15\n",
        "num_aug_per_label=25\n",
        "gpt2_pipe = nlu.load('gpt2')\n",
        "gpt2_pipe['gpt2'].setDoSample(True)\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(30)\n",
        "gpt2_pipe['document_assembler'].setCleanupMode('shrink')\n",
        "\n",
        "learn_rate=0.0005\n",
        "\n",
        "# Run Experiment\n",
        "no_aug_fitted_classifier, aug_fitted_classifier = compare_vanilla_and_augmented_training(\n",
        "  base_dataset=base_dataset,\n",
        "  generator_model=gpt2_pipe,\n",
        "  embeddings_to_use=embeddings_to_use,\n",
        "  n_per_class_total=n_per_class_total,\n",
        "  train_test_frac=train_test_frac,\n",
        "  label_col=label_col,\n",
        "  prompt_label_prefix=prompt_label_prefix,\n",
        "  slice_len=slice_len,\n",
        "  num_aug_per_label=num_aug_per_label,\n",
        "  epochs=epochs,\n",
        "  learn_rate=learn_rate,\n",
        "  batch_size=batch_size,\n",
        "  droput=droput)\n",
        "\n",
        "# Peformance gain for most classes, nice!\n",
        "# Some are much worse, we could use vanilla and the augmented model in our final pipeline with a Voting Mechanism on top or tweak more"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CPGxfU3Zk0PN",
        "outputId": "418542cb-af69-4485-ed0a-71fdd676f3da",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "gpt2 download started this may take some time.\n",
            "Approximate size to download 442.7 MB\n",
            "[OK!]\n",
            "__________________________________________________ Starting Data Augmentation Experiment __________________________________________________\n",
            "Train Dataset Size = 120\n",
            "Test Dataset Size = 1080\n",
            "Training Vanilla Classifier\n",
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "Training Settings : Epochs=5, learn_rate=0.0005, batch_size=32, dropout=0.5\n",
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n",
            "__________________________________________________ Metrics on Train datataset with Vanilla Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.60      0.80      0.69       270\n",
            "    Sci/Tech       0.84      0.34      0.48       270\n",
            "      Sports       0.72      0.97      0.83       270\n",
            "       World       0.82      0.76      0.79       270\n",
            "\n",
            "    accuracy                           0.72      1080\n",
            "   macro avg       0.75      0.72      0.70      1080\n",
            "weighted avg       0.75      0.72      0.70      1080\n",
            "\n",
            "__________________________________________________ Metrics on Test Dataset with Vanilla Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.72      0.78      0.75        50\n",
            "    Sci/Tech       1.00      0.35      0.51        52\n",
            "      Sports       0.57      1.00      0.73        49\n",
            "       World       0.82      0.70      0.75        46\n",
            "\n",
            "    accuracy                           0.70       197\n",
            "   macro avg       0.78      0.71      0.69       197\n",
            "weighted avg       0.78      0.70      0.68       197\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Generating for label=Sports No. 1/4 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <Sports-News-Headline: The Indiana Pa> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: The Indiana Paxton Report\n",
            "\n",
            "Athletic coach Dave Joerger was fired on Monday, a week after\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <Sports-News-Headline:Three weeks awa> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Three weeks awa\n",
            "\n",
            "Vancouver Whitecaps 2 Vancouver Whitecaps 1 Chris Rolfe / ESPN FC\n",
            "\n",
            "Seattle\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <Sports-News-Headline:Morocco #39;s H> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Morocco #39;s H&G #39,1H&G;&F1,1-\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <Sports-News-Headline:Teenage striker> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Teenage striker claims he has \"never had a stroke of strength\"\n",
            "\n",
            "Worcester forward Thomas Troughton\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <Sports-News-Headline:Paul Hamm takes> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Paul Hamm takes on Kevin Durant on TNT\n",
            "\n",
            "3/5/16: Paul Hamm: Kevin Durant's contract deal\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <Sports-News-Headline:New Zealand #39> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:New Zealand #39\n",
            "\n",
            "Posted February 20, 2012 12:59 AM\n",
            "\n",
            "Hi there,\n",
            "\n",
            "Last week\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <Sports-News-Headline:  British polic> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: British policewoman found dead after being stabbed in Westminster, Westminster, London, 2 May 2017 The Evening Standard's Mark\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <Sports-News-Headline: American 400 m> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: American 400 m.mpg.com, 1/24/12: 2.1 m.pbs.com\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <Sports-News-Headline: Greek sprinter> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Greek sprinter Thomas Gaikai has already won three world championships.\n",
            "\n",
            "Gaikai will race in the P\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <Sports-News-Headline: Goaltender Kev> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Goaltender Kevan Miller signs with San Jose Sharks\n",
            "\n",
            "The second is even harder to swallow: Why the Sharks\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <Sports-News-Headline:World record ho> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:World record hovers for most games in a given week\n",
            "\n",
            "(Source: NJ.com)\n",
            "\n",
            "NEW YORK\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <Sports-News-Headline:The first pick > __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:The first pick in the draft is expected to make the move this offseason after being a key component of the Rams' turnaround\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <Sports-News-Headline: Carly Patterso> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Carly Pattersoo joins ESPN to talk NBA TV's Big Four\n",
            "\n",
            "* The Cavaliers' coach made his first\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <Sports-News-Headline:Champions Ajax > __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Champions Ajax win in 6th\n",
            "\n",
            "LAST WEEK-SUBJECT: LA U-17\n",
            "\n",
            "*\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <Sports-News-Headline:The Football As> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:The Football Asphalt (with Brian Garett)\n",
            "\n",
            "By Steve Miller\n",
            "\n",
            "ESPN.com\n",
            "\n",
            "Thursday\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <Sports-News-Headline:WASHINGTON, Aug> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:WASHINGTON, Aug. 22, 2012\n",
            "\n",
            "\n",
            "NIGALIANA CONTS A SAD DAY with two more days\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <Sports-News-Headline:US heavyweight > __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:US heavyweight contender Floyd Mayweather Jr. defeated Rafael dos Anjos in the main event. Mayweather was booked for the middle\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <Sports-News-Headline:Scottish champi> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Scottish champi: F.E.R!'s 's new 'It Was A Great Year' video after\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <Sports-News-Headline: Hungarian Olym> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Hungarian Olymour Klimoschuk Named NHL Head Coach - NHL Network (Sportsnet) (Sunday, 9\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <Sports-News-Headline:scoring offense> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:scoring offense is 'hard to do against any team with a 3.9 or younger offensive line' – Marcio Jose\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <Sports-News-Headline: The Montreal E> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: The Montreal EAGLES.\n",
            "\n",
            "Seahawks-Chiefs and Falcons-NFL: How the Seahawks are a\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <Sports-News-Headline:  Baseball comm> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Baseball commites first-round pick to Cubs with 'high' target to make $14 million\n",
            "\n",
            "CLOSE MLB officials\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <Sports-News-Headline: Greek sprinter> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Greek sprinter Gianluca Brambilla seeks return to full\n",
            "\n",
            "Rome: A second consecutive world champion's\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <Sports-News-Headline:Heather O #39;R> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline:Heather O #39;Rams-Cowboys:Heath Lederer with 20,633 Yards in NFL\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <Sports-News-Headline:  Cael Sanderso> __________________________________________________\n",
            "Generated : \n",
            "  Sports-News-Headline: Cael Sanderso's Return to 'Rigged' NBA as 'Expectations and Possible Changes'\n",
            "\n",
            "\n",
            "__________________________________________________ Generating for label=Sci/Tech No. 2/4 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <Sci/Tech-News-Headline:The PlayStation> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:The PlayStation Vita is coming tomorrow. This is a PlayStation 4 release, but not a PlayStation Vita game because the\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <Sci/Tech-News-Headline:Simple to code > __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Simple to code and write a basic HTML page using just a few lines of JavaScript.\n",
            "\n",
            "\n",
            "In this tutorial\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <Sci/Tech-News-Headline:A team retracin> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:A team retracin and the future of stem cells\n",
            "\n",
            "\"In the near future, there will be\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <Sci/Tech-News-Headline:Judges send cas> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Judges send casters a $1000 fine for using unregistered e-mail addresses\n",
            "\n",
            "\"It took\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <Sci/Tech-News-Headline:A giant 100km c> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:A giant 100km creeper at a high point of Australia's Pacific Coast will continue to move west to\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <Sci/Tech-News-Headline:By acquiring KV> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:By acquiring KVN, the government will end restrictions on 'high volume adoption.' By getting rid of regulations\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <Sci/Tech-News-Headline:  Cisco Systems> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: Cisco Systems CEO in Business Outlook's 30th Year of Sales, and how to get it right\n",
            "\n",
            "How\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <Sci/Tech-News-Headline:In its first tw> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:In its first twelfth article, Science (www.science.org) revealed that:We know the Earth\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <Sci/Tech-News-Headline: (Aug 22, 2004)> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: (Aug 22, 2004) The Department of Defense is trying to buy \"a significant portion\" of its missile\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <Sci/Tech-News-Headline: Two new moons > __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: Two new moons orbiting the Moon. http://www.sciencenews.org/article/201506\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <Sci/Tech-News-Headline: Defects in Sie> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: Defects in Sieur-Ein-Meinbock (Science) A large outbreak of sal\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <Sci/Tech-News-Headline:Description: An> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Description: An entire planet has disappeared from the solar system. But there were still more mysterious things to this world\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <Sci/Tech-News-Headline:Roland Piquepai> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Roland Piquepai\n",
            "\n",
            "• Apple and Samsung offer a special deal on the first 4K video\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <Sci/Tech-News-Headline: Organizations > __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: Organizations from across the world are sharing the stories of science, economics and health care\n",
            "\n",
            "Scientists from all over\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <Sci/Tech-News-Headline: For pandas, it> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: For pandas, it's difficult to tell one's mate, it seems. I find it hard to understand\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <Sci/Tech-News-Headline: Oracle (Nasdaq> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: Oracle (Nasdaq: ORCL), an automotive research firm that has reported major investments in Japanese companies like\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <Sci/Tech-News-Headline: Southern Calif> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: Southern Calif. 'Bacon-free\"\n",
            "\n",
            "\"I was always going to be very worried about those\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <Sci/Tech-News-Headline:Customers of Sp> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Customers of Spikes have been told by the US Department of State has received complaints and will be contacting them\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <Sci/Tech-News-Headline: A team of scie> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: A team of sciech engineers working to improve the way the digital world sees nanotechnology, with a view\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <Sci/Tech-News-Headline: Internosis Inc> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: Internosis Inc., a vaccine designed to prevent brain tumors called MRSA, has been called out as a potential\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <Sci/Tech-News-Headline:SAN JOSE, Calif> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:SAN JOSE, Calif. – An emergency medical provider who said he felt he felt like a bomb was hitting\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <Sci/Tech-News-Headline:EMC has hired a> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:EMC has hired a full-time member, who worked on that issue of The New York Times as a\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <Sci/Tech-News-Headline: TORONTO, Canad> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: TORONTO, Canaday, CANADA, 1,1,2015.\n",
            "\n",
            "SURRY THE\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <Sci/Tech-News-Headline: The United Kin> __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline: The United Kinetics Institute in Chicago.\n",
            "\n",
            "Copyright Science News Service 2014. All rights reserved. This material\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <Sci/Tech-News-Headline:Microsoft will > __________________________________________________\n",
            "Generated : \n",
            "  Sci/Tech-News-Headline:Microsoft will be making its next $2 billion in additional acquisitions this year\n",
            "\n",
            "'I'm just thrilled to\n",
            "__________________________________________________ Generating for label=Business No. 3/4 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <Business-News-Headline:quarter earning> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:quarter earning $1.4bn in six months, up by $5.4b from a year ago\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <Business-News-Headline:UAL's United Ai> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:UAL's United Ai, Ai Group, Ai, US National News Service Website Online News Links\n",
            "\n",
            "* A few additional\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <Business-News-Headline: The newly rele> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline: The newly relevantly Chinese government recently ordered an immediate ban on internet content in the Communist country, the nation's second\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <Business-News-Headline:Ford Motor Co. > __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Ford Motor Co. Chief Financial Officer\n",
            "\n",
            "Chief Financial Officer Mary Jo White with President and CEO Mark Fields at the U\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <Business-News-Headline:Description: A > __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Description: A report by the government research commission's National Bureau of Economic Research has concluded an \"economic downturn is unlikely to\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <Business-News-Headline: The dollar idl> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline: The dollar idl is surging by 40% at $4.15 the minute of press at this point https://twitter\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <Business-News-Headline:Factory orders > __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Factory orders 6-story buildings for its 7-story mixed-use building,\n",
            "\n",
            "Tulsa-based Build\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <Business-News-Headline:New Delhi, Augu> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:New Delhi, Auguar-1-2016\n",
            "\n",
            "Pramod Mishra says Dravid and other Dal\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <Business-News-Headline:A closely watch> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:A closely watchable and consistent approach to combat the global financial crisis.\n",
            "\n",
            "WASHINGTON DC - Jan 21: The White\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <Business-News-Headline: The CEOs of th> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline: The CEOs of thamco.com, AOL.com and e-mail is for those who can't afford to\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <Business-News-Headline:NEW YORKFewer A> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:NEW YORKFewer Aims of the New Economy\" by Andrew MacDougall, \"Economic Performance,\" August 7,\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <Business-News-Headline:Johnson  amp; J> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Johnson amp; Japantown, NY: The Bibliotheca Mundi; 531 S. Elm St\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <Business-News-Headline:Federal officia> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Federal officia can't help but feel overwhelmed\n",
            "\n",
            "U.S. Court of Appeals for the 2nd Circuit (\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <Business-News-Headline: U.S. Treasury > __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline: U.S. Treasury Department announces new program to help low-income students receive federal aid\n",
            "\n",
            "U.S Treasury\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <Business-News-Headline:One of the men > __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:One of the men who stole the story: \"After an initial investigation, this week's investigation showed the man had been\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <Business-News-Headline:Beleaguered Rus> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Beleaguered Rushes to Rise in Trumpism, but Has Fewer Business-Favorites\n",
            "\n",
            "This writer\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <Business-News-Headline: Kroger Co., th> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline: Kroger Co., throws in $5.8 billion in cash for next year alone\n",
            "\n",
            "By the end of\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <Business-News-Headline:Goldwyn Meyer h> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Goldwyn Meyer hoses GOP primary fight at the weekend\n",
            "\n",
            "The former Republican National Committee chief was accused of soliciting\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <Business-News-Headline:After one of th> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:After one of thorns, he'd done a good job at it. 'He had taken so many, so many\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <Business-News-Headline:Australia #39;s> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Australia #39;s:1324 – 1329;a:1;c:1\"quality=orl\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <Business-News-Headline:Continental Air> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Continental Air Lines Will Get the Best Price for International Planes With New Aircraft\n",
            "\n",
            "U.S. airlines will get\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <Business-News-Headline:A bank in Belar> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:A bank in Belarussian territory will not say whether Mr. Voorhis has paid the ransom.\n",
            "\n",
            "\"\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <Business-News-Headline:MARK COLVIN: Qa> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:MARK COLVIN: Qa. How likely are the candidates for the next U.S. Senate seat going\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <Business-News-Headline:The US textile > __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:The US textile industry is growing faster than ever, and companies are making huge gains,\" said Matthew A. Schulz,\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <Business-News-Headline:Deputy Prime Mi> __________________________________________________\n",
            "Generated : \n",
            "  Business-News-Headline:Deputy Prime Mi. Putin's Russia Is the Stuffed Top to Be Entangled by the U.S. President\n",
            "__________________________________________________ Generating for label=World No. 4/4 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <World-News-Headline:Panama recalls > __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Panama recalls 'disgraceful' visit to US 'during Cuban missile crisis'\n",
            "\n",
            "As US officials scrambled\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <World-News-Headline: A Jewish socia> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: A Jewish socia-\n",
            "\n",
            "mental. We have now seen a full\n",
            "\n",
            "review of this issue, and the\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <World-News-Headline:At least seven > __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:At least seven deaths in two separate crash in Japan in 2014\n",
            "\n",
            "At least 24 people have died on the road in\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <World-News-Headline: The United Sta> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: The United Sta­dio­tion's Expos­ible Theat­ic Report On Syria\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <World-News-Headline:The Israeli arm> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:The Israeli armahad group FDD is an extremist organization dedicated to overthrowing Israeli rule. FDD leaders claim responsibility\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <World-News-Headline:rebels struggle> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:rebels struggle to win despite victory over Roma\n",
            "\n",
            "\"At the beginning, it was going to end in a stal\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <World-News-Headline: Democratic Sen> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: Democratic Sen. Richard Blumenthal's \"Unfaithful\" Response to Obama's 'Anti-LGBT Agenda'\n",
            "\n",
            "Read\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <World-News-Headline: Marion Jones m> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: Marion Jones mocks rape accusation made during US v Donald Trump campaign, claims the media is \"fake\"\n",
            "\n",
            "Mr\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <World-News-Headline: The United Sta> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: The United Sta.C.R.?\n",
            "\n",
            "This section is dedicated to the Sta...\n",
            "\n",
            "There are many ways\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <World-News-Headline:Peace talks bet> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Peace talks betrothed to Jerusalem, Jordan are dead, says US officials\n",
            "\n",
            "US leaders have told their leaders in\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <World-News-Headline: A Sudanese reb> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: A Sudanese rebukes UN\n",
            "\n",
            "On Monday evening, Sudanese President Omar Hassan Al Barghouti called on\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <World-News-Headline:THE internation> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:THE internation, a \"very large group of people, mostly from different countries, will be present for several hours,\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <World-News-Headline:BRITAIN has war> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:BRITAIN has warred with the EU to fight an international arms race. What are these treaties, the EU's\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <World-News-Headline: Democratic Whi> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: Democratic Whiitelashes Bill C-46, Bill Maher\n",
            "\n",
            "Watch: \"A Dangerous Message\" by Laura\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <World-News-Headline: The last survi> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: The last surviress to be found\n",
            "\n",
            "The missing former Marine was found in a wooded area near a farm\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <World-News-Headline: Rebel Shi'ite > __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: Rebel Shi'ite-Germans in Ukraine's Deceased, and Their \"Consequences\"? (C\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <World-News-Headline:LONDON, AUGUST > __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:LONDON, AUGUST 19 (Reuters) - Thousands of British voters turned out to a London party room to watch\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <World-News-Headline:A leaked Israel> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:A leaked Israel-News report says Israel wants to behead U.S. journalist Steven Sotloff\n",
            "\n",
            "\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <World-News-Headline:Tensions betwee> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Tensions betweered in Yemen are set to increase over coming weeks, with Saudi Arabia to close its border\n",
            "\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <World-News-Headline: Fans of Michae> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: Fans of Michaeldis had a taste of the new summer of soccer Tuesday afternoon when he announced his retirement, promising\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <World-News-Headline: CARACAS, Venez> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline: CARACAS, Venezia, D. \"Mexico to Reject SINGLE REFUGE for Venez\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <World-News-Headline:TBILISI, Georgi> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:TBILISI, Georgi Ivanov, \"Russia's President Says the United States Should Not Pay North Korea for\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <World-News-Headline:Israeli militar> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Israeli militarizes 'Jewish state' on video\n",
            "\n",
            "The New York Times, the main news website in the United States\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <World-News-Headline:AGHDAD, Iraq, A> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:AGHDAD, Iraq, AUG 17th, 2017.\n",
            "\n",
            "Iraq's Oil Production Exports\n",
            "\n",
            "Total\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <World-News-Headline:Tuesday: A Shii> __________________________________________________\n",
            "Generated : \n",
            "  World-News-Headline:Tuesday: A Shii-Gai War Is At Work, Wives Have Disclosed A Massive Baby To Watch\n",
            "Augmented Training Dataset Size = 220\n",
            "Training Augmented Classifierr\n",
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "Training Settings : Epochs=5, learn_rate=0.0005, batch_size=32, dropout=0.5\n",
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n",
            "__________________________________________________ Metrics on vanilla Train datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       1.00      0.14      0.25        50\n",
            "    Sci/Tech       0.62      0.92      0.74        52\n",
            "      Sports       0.83      0.88      0.85        49\n",
            "       World       0.69      0.91      0.79        46\n",
            "\n",
            "    accuracy                           0.71       197\n",
            "   macro avg       0.78      0.71      0.66       197\n",
            "weighted avg       0.78      0.71      0.65       197\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on vanilla Train datataset with VANILLA Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.60      0.80      0.69       270\n",
            "    Sci/Tech       0.84      0.34      0.48       270\n",
            "      Sports       0.72      0.97      0.83       270\n",
            "       World       0.82      0.76      0.79       270\n",
            "\n",
            "    accuracy                           0.72      1080\n",
            "   macro avg       0.75      0.72      0.70      1080\n",
            "weighted avg       0.75      0.72      0.70      1080\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Test datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.86      0.07      0.14        81\n",
            "    Sci/Tech       0.61      0.86      0.71        90\n",
            "      Sports       0.71      0.88      0.79        78\n",
            "       World       0.70      0.87      0.77        77\n",
            "\n",
            "    accuracy                           0.67       326\n",
            "   macro avg       0.72      0.67      0.60       326\n",
            "weighted avg       0.72      0.67      0.60       326\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Test datataset with VANILLA Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.72      0.78      0.75        50\n",
            "    Sci/Tech       1.00      0.35      0.51        52\n",
            "      Sports       0.57      1.00      0.73        49\n",
            "       World       0.82      0.70      0.75        46\n",
            "\n",
            "    accuracy                           0.70       197\n",
            "   macro avg       0.78      0.71      0.69       197\n",
            "weighted avg       0.78      0.70      0.68       197\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Augmented Train datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    Business       0.91      0.11      0.20       270\n",
            "    Sci/Tech       0.58      0.80      0.67       270\n",
            "      Sports       0.79      0.92      0.85       270\n",
            "       World       0.64      0.86      0.73       270\n",
            "\n",
            "    accuracy                           0.67      1080\n",
            "   macro avg       0.73      0.67      0.61      1080\n",
            "weighted avg       0.73      0.67      0.61      1080\n",
            "\n",
            "______________________________________________________________________________________________________________________________________________________\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Data Augmentation on a Finance Dataset\n",
        "\n",
        "![img](https://www.nasdaq.com/sites/acquia.prod/files/image/29525db076bcc42505a356e55dbe94f38b28530b_getty-stock-market-data.jpg?1540063537)\n",
        "\n",
        "Using data from https://www.kaggle.com/datasets/yash612/stockmarket-sentiment-dataset "
      ],
      "metadata": {
        "id": "e-g85HX-1r2_",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "! wget http://ckl-it.de/wp-content/uploads/2021/02/stock_data.csv\n",
        "# Finance: Twitter Sentiment Data  \n",
        "# https://www.kaggle.com/datasets/yash612/stockmarket-sentiment-dataset\n",
        "pd.set_option('max_colwidth', 800)\n",
        "\n",
        "train_df = pd.read_csv('/content/stock_data.csv')\n",
        "columns=['text','y']\n",
        "train_df = train_df[columns]\n",
        "train_df.y.value_counts().plot.barh(title='Label Distribution')\n",
        "train_df.head(10)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 830
        },
        "id": "kg1jFLslbqB1",
        "outputId": "c9a0af40-a3ee-43ea-e3c4-f158c396b6a9",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-09-04 11:41:27--  http://ckl-it.de/wp-content/uploads/2021/02/stock_data.csv\n",
            "Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209\n",
            "Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 758217 (740K) [text/csv]\n",
            "Saving to: ‘stock_data.csv’\n",
            "\n",
            "stock_data.csv      100%[===================>] 740.45K  --.-KB/s    in 0.1s    \n",
            "\n",
            "2022-09-04 11:41:27 (7.14 MB/s) - ‘stock_data.csv’ saved [758217/758217]\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                                                                                                                        text  \\\n",
              "0                                    AAP 950 lot bid in the Feb 580 call at 9.50, might push the stock up or create support. 26 delta option   \n",
              "1                                                   user: ADBE continues to weaken, not following market. Short on break under 37.60. user     \n",
              "2                                                                 pnra missed this one yest on the short side but will short under 50 sma      \n",
              "3   PPO: 2) GM estimates 20% Y/Y unit growth for volt (36K). great headline, but PPO built capacity for 60K+. this is bad for margins. (2/4)   \n",
              "4                                                                                                             BAC avrg positive2.positive27    \n",
              "5  CMG While traveling through Kalamazoo, MI on Friday at 6:00, the place was empty. This is the growth hopes for the company. eshorted  325   \n",
              "6                                 i dont have a tech friend in world that owns the stock ...thats why user: PCN No one Believes anymore :      \n",
              "7                                                                                                                            SWI was a beaut   \n",
              "8       Again, this mkt is FAT while AAP is down 2.5% cratering thru huge psychological level, tells me this tape is strong SPY #dontbeshort   \n",
              "9                                                                                    APO good volume bounce; entry 22.positive9 stop 20.50     \n",
              "\n",
              "          y  \n",
              "0  positive  \n",
              "1  negative  \n",
              "2  negative  \n",
              "3  negative  \n",
              "4  positive  \n",
              "5  negative  \n",
              "6  positive  \n",
              "7  positive  \n",
              "8  positive  \n",
              "9  positive  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-5d590d81-3c7a-4370-8fc4-6c4a2f4e1437\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>y</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>AAP 950 lot bid in the Feb 580 call at 9.50, might push the stock up or create support. 26 delta option</td>\n",
              "      <td>positive</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>user: ADBE continues to weaken, not following market. Short on break under 37.60. user</td>\n",
              "      <td>negative</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>pnra missed this one yest on the short side but will short under 50 sma</td>\n",
              "      <td>negative</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>PPO: 2) GM estimates 20% Y/Y unit growth for volt (36K). great headline, but PPO built capacity for 60K+. this is bad for margins. (2/4)</td>\n",
              "      <td>negative</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>BAC avrg positive2.positive27</td>\n",
              "      <td>positive</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5</th>\n",
              "      <td>CMG While traveling through Kalamazoo, MI on Friday at 6:00, the place was empty. This is the growth hopes for the company. eshorted  325</td>\n",
              "      <td>negative</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>6</th>\n",
              "      <td>i dont have a tech friend in world that owns the stock ...thats why user: PCN No one Believes anymore :</td>\n",
              "      <td>positive</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>7</th>\n",
              "      <td>SWI was a beaut</td>\n",
              "      <td>positive</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>8</th>\n",
              "      <td>Again, this mkt is FAT while AAP is down 2.5% cratering thru huge psychological level, tells me this tape is strong SPY #dontbeshort</td>\n",
              "      <td>positive</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9</th>\n",
              "      <td>APO good volume bounce; entry 22.positive9 stop 20.50</td>\n",
              "      <td>positive</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5d590d81-3c7a-4370-8fc4-6c4a2f4e1437')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-5d590d81-3c7a-4370-8fc4-6c4a2f4e1437 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-5d590d81-3c7a-4370-8fc4-6c4a2f4e1437');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 51
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY8AAAEICAYAAACnL3iHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAS70lEQVR4nO3de7BlZX3m8e8jjSgXgQY1bSscIYSIMSJ0DDpqTEghioYxIhAxQEzKcaJxHGMcHCczJGqqDYllLJJBMyEQREEwlpYSg3FCYkgQurG5Cc1Fm8IGubRyMWAG8Dd/7PeYxfH05T23ffrw/VTtOmu/6/Zb7957Pedda/fpVBWSJPV4wrgLkCRtfwwPSVI3w0OS1M3wkCR1MzwkSd0MD0lSN8NDjxtJLknyGwu9blv/pUnWz3T9abb3N0lOatMnJ/mnOdz2CUkunqvtaWkyPLTdSbIhyS+Ou45JSU5N8nCSB9rjxiSnJ1kxuUxVfaWqDtzGbX18a8tV1Sur6uw5qH0iSSVZNtj2uVV1xGy3raXN8JDmxvlVtRuwHHgt8GPA2mGAzIWM+LnV2Pkm1JKRZM8kn09yd5LvtulnTlls/ySXJ7k/yWeTLB+sf1iSf05yb5Krkry8t4aqeriqrgOOA+4Gfrtt++VJvjXY139LsrGNVNYnOTzJkcB/B45L8r0kV7VlL0nygSSXAg8C+01zGS1ttHNfkhuSHD6Y8ZiR2pTRzT+2n/e2fb5o6mWwJC9OckXb9hVJXjyYd0mS9yW5tB3LxUn27u03bX8MDy0lTwD+EtgX2Ad4CDh9yjInAm8CVgCPAB8BSLIS+ALwfkajh3cBn07y1JkUUlWPAp8FXjp1XpIDgbcBP9NGK68ANlTVF4E/YDSK2bWqnj9Y7VeBNwO7AbdOs8ufBW4B9gb+F/DXw2Dcgpe1n3u0ff7LlFqXM+qXjwB7AR8CvpBkr8FibwB+DXga8ERGfaclzvDQklFVm6rq01X1YFU9AHwA+Lkpi51TVddW1b8Cvwscm2QH4I3ARVV1UVX9oKq+BKwBXjWLkm5nFERTPQrsBByUZMeq2lBVt2xlW2dV1XVV9UhVPTzN/LuAD7eRz/nAeuCoWdQ+6Sjgpqo6p+37k8ANwGsGy/xlVd1YVQ8BnwIOnoP9apEzPLRkJNk5yUeT3JrkfkaXZPZo4TDptsH0rcCOjH5b3xd4fbtkdW+Se4GXMBqhzNRK4DtTG6vqZuAdwKnAXUnOS/KMrWzrtq3M31iP/SuntwJb2+a2eAY/OtK5ldGxTfr2YPpBYNc52K8WOcNDS8lvAwcCP1tVT+HfL8lksMyzBtP7AA8D9zA6OZ9TVXsMHrtU1eqZFNJuar8G+Mp086vqE1X1EkahVcAHJ2dtZpNb+/PXK5MMj3MfRiMfgH8Fdh7M+7GO7d7eahzaB9i4lfW0xBke2l7tmORJg8cyRvcDHmJ083c5o2v/U70xyUFJdgZ+H7iw3Z/4OPCaJK9IskPb5sunueG+RUmWJXkO8ElGJ+kPTbPMgUl+IclOwPdbzT9os+8EJmbwjaqnAW9PsmOS1wPPAS5q89YBx7d5q4BjBuvd3fa932a2exHwE0ne0I7tOOAg4POd9WmJMTy0vbqI0Ul38nEq8GHgyYxGEpcBX5xmvXOAsxhdankS8HaAqroNOJrRt53uZjQS+R22/TNyXJLvAfcBnwM2AYdW1e3TLLsTsLrV+W1GJ/73tHkXtJ+bkly5jfsG+CpwQNvmB4BjqmpTm/e7wP7Ad4HfAz4xuVJVPdiWv7RdrjtsuNG2jVczGtVtAt4NvLqq7umoTUtQ/M+gJEm9HHlIkroZHpKkboaHJKmb4SFJ6rZs64ssDXvvvXdNTEyMuwxJ2q6sXbv2nqr6kT/T87gJj4mJCdasWTPuMiRpu5Jkur+l5mUrSVI/w0OS1M3wkCR1MzwkSd0MD0lSN8NDktTN8JAkdTM8JEndDA9JUjfDQ5LUzfCQJHUzPCRJ3QwPSVI3w0OS1M3wkCR1MzwkSd0MD0lSN8NDktTN8JAkdTM8JEndDA9JUjfDQ5LUzfCQJHUzPCRJ3QwPSVI3w0OS1G3ZuAtYKNdsvI+JU74w7jIkaUFtWH3UvGzXkYckqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuhkekqRuhockqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuo09PJLskeQ3B8+fkeTCcdYkSdqysYcHsAfww/Coqtur6pgx1iNJ2oqthkeSiSTXJ/nzJNcluTjJk5Psn+SLSdYm+UqSn2zL75/ksiTXJHl/ku+19l2TfDnJlW3e0W0Xq4H9k6xLclrb37VtncuSPHdQyyVJViXZJcmZSS5P8rXBtiRJC2BbRx4HAH9aVc8F7gVeB3wM+K2qOhR4F/Bnbdk/Af6kqp4HfGuwje8Dr62qQ4CfB/44SYBTgFuq6uCq+p0p+z0fOBYgyQpgRVWtAd4L/N+qemHb1mlJdpladJI3J1mTZM2jD963jYcqSdqabQ2Pb1bVuja9FpgAXgxckGQd8FFgRZv/IuCCNv2JwTYC/EGSq4G/A1YCT9/Kfj8FTF7COhaYvBdyBHBK2/clwJOAfaauXFUfq6pVVbVqh51334bDlCRti2XbuNy/DaYfZXTSv7eqDu7Y1wnAU4FDq+rhJBsYnfQ3q6o2JtmU5KeB44C3tFkBXldV6zv2L0maIzO9YX4/8M0krwfIyPPbvMsYXdYCOH6wzu7AXS04fh7Yt7U/AOy2hX2dD7wb2L2qrm5tfwv8VrvsRZIXzPA4JEkzMJtvW50A/HqSq4DrgMmb1u8A3tkuT/04MHmz4VxgVZJrgBOBGwCqahNwaZJrk5w2zX4uZBRCnxq0vQ/YEbg6yXXtuSRpgWz1slVVbQB+avD8jwazj5xmlY3AYVVVSY4HDmzr3cPofsh0+3jDlKbh/u6cWmdVPQT8p63VLkmaH9t6z6PHocDp7ZLSvcCb5mEfkqQxmvPwqKqvAM/f6oKSpO3WYvgX5pKk7YzhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuhkekqRuhockqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6zfn/Yb5YPW/l7qxZfdS4y5CkJcGRhySpm+EhSepmeEiSuhkekqRuhockqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuhkekqRuhockqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuhkekqRuhockqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuhkekqRuhockqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuhkekqRuhockqduycRewUK7ZeB8Tp3xh3GVI0oLasPqoedmuIw9JUjfDQ5LUzfCQJHUzPCRJ3QwPSVI3w0OS1M3wkCR1MzwkSd0MD0lSN8NDktTN8JAkdTM8JEndDA9JUjfDQ5LUzfCQJHUzPCRJ3QwPSVI3w0OS1M3wkCR1MzwkSd0MD0lSN8NDktTN8JAkdRtbeCR5S5IT2/TJSZ4xmPd/khw0rtokSVu2bFw7rqozBk9PBq4Fbm/zfmMcNUmSts2MRh5JJpLckOTcJNcnuTDJzkkOT/K1JNckOTPJTm351Um+nuTqJH/U2k5N8q4kxwCrgHOTrEvy5CSXJFnVRienDfZ7cpLT2/Qbk1ze1vlokh1m3x2SpG0xm8tWBwJ/VlXPAe4H3gmcBRxXVc9jNKr5z0n2Al4LPLeqfhp4/3AjVXUhsAY4oaoOrqqHBrM/3daddBxwXpLntOn/UFUHA48CJ0wtMMmbk6xJsubRB++bxaFKkoZmEx63VdWlbfrjwOHAN6vqxtZ2NvAy4D7g+8BfJPll4MFt3UFV3Q18I8lhLYR+Eri07etQ4Iok69rz/aZZ/2NVtaqqVu2w8+4zOkhJ0o+azT2PmvL8XmCvH1mo6pEkL2R0gj8GeBvwCx37OQ84FrgB+ExVVZIAZ1fVe2ZUuSRpVmYz8tgnyYva9BsYXXqaSPLjre1XgX9Isiuwe1VdBPxX4PnTbOsBYLfN7OczwNHArzAKEoAvA8ckeRpAkuVJ9p3FsUiSOsxm5LEeeGuSM4GvA28HLgMuSLIMuAI4A1gOfDbJk4Awujcy1VnAGUkeAl40nFFV301yPXBQVV3e2r6e5H8AFyd5AvAw8Fbg1lkcjyRpG6Vq6tWnbVgpmQA+X1U/NdcFzZedVhxQK0768LjLkKQFtWH1UbNaP8naqlo1td1/YS5J6jajy1ZVtQHYbkYdkqS55chDktTN8JAkdTM8JEndDA9JUjfDQ5LUzfCQJHUzPCRJ3QwPSVI3w0OS1M3wkCR1MzwkSd0MD0lSN8NDktTN8JAkdTM8JEndDA9JUjfDQ5LUzfCQJHUzPCRJ3QwPSVK3ZeMuYKE8b+XurFl91LjLkKQlwZGHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuhkekqRuhockqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6mZ4SJK6GR6SpG6GhySpm+EhSepmeEiSuhkekqRuhockqZvhIUnqZnhIkroZHpKkboaHJKmb4SFJ6paqGncNCyLJA8D6cdexBXsD94y7iC2wvplbzLWB9c3WUq9v36p66tTGZbPY4PZmfVWtGncRm5NkjfXN3GKubzHXBtY3W4/X+rxsJUnqZnhIkro9nsLjY+MuYCusb3YWc32LuTawvtl6XNb3uLlhLkmaO4+nkYckaY4YHpKkbks+PJIcmWR9kpuTnDKmGp6V5O+TfD3JdUn+S2s/NcnGJOva41WDdd7Tal6f5BULUOOGJNe0Ota0tuVJvpTkpvZzz9aeJB9p9V2d5JB5ru3AQR+tS3J/kneMs/+SnJnkriTXDtq6+yvJSW35m5KcNM/1nZbkhlbDZ5Ls0donkjw06MczBusc2t4XN7djyDzW1/16zsfnezO1nT+oa0OSda19HH23ufPJwr7/qmrJPoAdgFuA/YAnAlcBB42hjhXAIW16N+BG4CDgVOBd0yx/UKt1J+DZ7Rh2mOcaNwB7T2n7Q+CUNn0K8ME2/Srgb4AAhwFfXeDX9NvAvuPsP+BlwCHAtTPtL2A58I32c882vec81ncEsKxNf3BQ38RwuSnbubzVnHYMr5zH+rpez/n6fE9X25T5fwz8zzH23ebOJwv6/lvqI48XAjdX1Teq6v8B5wFHL3QRVXVHVV3Zph8ArgdWbmGVo4HzqurfquqbwM2MjmWhHQ2c3abPBv7joP2vauQyYI8kKxaopsOBW6rq1i0sM+/9V1X/CHxnmv329NcrgC9V1Xeq6rvAl4Aj56u+qrq4qh5pTy8DnrmlbbQan1JVl9XobPNXg2Oa8/q2YHOv57x8vrdUWxs9HAt8ckvbmOe+29z5ZEHff0s9PFYCtw2ef4stn7TnXZIJ4AXAV1vT29pQ8szJYSbjqbuAi5OsTfLm1vb0qrqjTX8bePoY65t0PI/94C6W/oP+/hpnP76J0W+jk56d5GtJ/iHJS1vbylbTQtbX83qOo/9eCtxZVTcN2sbWd1POJwv6/lvq4bGoJNkV+DTwjqq6H/jfwP7AwcAdjIbD4/KSqjoEeCXw1iQvG85svz2N9XvdSZ4I/BJwQWtaTP33GIuhvzYnyXuBR4BzW9MdwD5V9QLgncAnkjxlDKUt2tdz4Fd47C8vY+u7ac4nP7QQ77+lHh4bgWcNnj+ztS24JDsyeqHPraq/BqiqO6vq0ar6AfDn/PullQWvu6o2tp93AZ9ptdw5eTmq/bxrXPU1rwSurKo7W62Lpv+a3v5a8DqTnAy8GjihnWBol4M2tem1jO4j/ESrZXhpa17rm8HruaD9l2QZ8MvA+YOax9J3051PWOD331IPjyuAA5I8u/3WejzwuYUuol0n/Qvg+qr60KB9eJ/gtcDktzs+BxyfZKckzwYOYHTzbb7q2yXJbpPTjG6sXtvqmPwGxknAZwf1ndi+xXEYcN9guDyfHvNb32Lpv4He/vpb4Igke7ZLNEe0tnmR5Ejg3cAvVdWDg/anJtmhTe/HqL++0Wq8P8lh7T184uCY5qO+3tdzoT/fvwjcUFU/vBw1jr7b3PmEhX7/zcXd/8X8YPRNgxsZ/Ubw3jHV8BJGQ8irgXXt8SrgHOCa1v45YMVgnfe2mtczR9/S2EJ9+zH6pspVwHWT/QTsBXwZuAn4O2B5aw/wp62+a4BVC9CHuwCbgN0HbWPrP0YhdgfwMKNrxb8+k/5idO/h5vb4tXmu72ZG17gn34NntGVf1173dcCVwGsG21nF6CR+C3A67a9SzFN93a/nfHy+p6uttZ8FvGXKsuPou82dTxb0/eefJ5EkdVvql60kSfPA8JAkdTM8JEndDA9JUjfDQ5LUzfCQJHUzPCRJ3f4/EFNOJYGecP8AAAAASUVORK5CYII=\n"
          },
          "metadata": {
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Some Tweaks\n",
        "prompt_label_prefix='-finance-tweet:'\n",
        "slice_len=25\n",
        "num_aug_per_label=30\n",
        "\n",
        "no_aug_fitted_classifier, aug_fitted_classifier = compare_vanilla_and_augmented_training(\n",
        "  base_dataset=train_df,\n",
        "  generator_model=gpt2_pipe,\n",
        "  embeddings_to_use=embeddings_to_use,\n",
        "  n_per_class_total=n_per_class_total,\n",
        "  train_test_frac=train_test_frac,\n",
        "  label_col=label_col,\n",
        "  prompt_label_prefix=prompt_label_prefix,\n",
        "  slice_len=slice_len,\n",
        "  num_aug_per_label=num_aug_per_label,\n",
        "  epochs=epochs,\n",
        "  learn_rate=learn_rate,\n",
        "  batch_size=batch_size,\n",
        "  droput=droput)\n",
        "\n",
        "# Improvement of 1% in 1 class, we could get more with more tweaking"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jxgIQCSV1nCz",
        "outputId": "db0c39f7-208c-4b93-fc28-b1c78d38a95f",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "__________________________________________________ Starting Data Augmentation Experiment __________________________________________________\n",
            "Train Dataset Size = 60\n",
            "Test Dataset Size = 540\n",
            "Training Vanilla Classifier\n",
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "Training Settings : Epochs=5, learn_rate=0.0005, batch_size=32, dropout=0.5\n",
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n",
            "__________________________________________________ Metrics on Train datataset with Vanilla Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    negative       0.55      0.59      0.57       270\n",
            "    positive       0.55      0.51      0.53       270\n",
            "\n",
            "    accuracy                           0.55       540\n",
            "   macro avg       0.55      0.55      0.55       540\n",
            "weighted avg       0.55      0.55      0.55       540\n",
            "\n",
            "__________________________________________________ Metrics on Test Dataset with Vanilla Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    negative       0.83      0.79      0.81        48\n",
            "    positive       0.82      0.85      0.84        54\n",
            "\n",
            "    accuracy                           0.82       102\n",
            "   macro avg       0.82      0.82      0.82       102\n",
            "weighted avg       0.82      0.82      0.82       102\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Generating for label=negative No. 1/2 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/30 for prompt : <negative-finance-tweet:AAP Ouch !!!!> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:AAP Ouch !!!! We'll be back after 5pm tomorrow to celebrate our 15th birthday!!!\n",
            "\n",
            "\n",
            "__________________________________________________ Generating new training data example 2/30 for prompt : <negative-finance-tweet:AAP 422 nxt tgt> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:AAP 422 nxt tgt gt pwd nt vw dx hx y3 tn\n",
            "__________________________________________________ Generating new training data example 3/30 for prompt : <negative-finance-tweet:CEE mother faacker.> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:CEE mother faacker. But what about kids, that can't be taken seriously:\n",
            "\n",
            "I'm a\n",
            "__________________________________________________ Generating new training data example 4/30 for prompt : <negative-finance-tweet:SHOT Setups: AAP GOOG BBY> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:SHOT Setups: AAP GOOG BBY JUAN GOO GOO SRS PORTRA\n",
            "__________________________________________________ Generating new training data example 5/30 for prompt : <negative-finance-tweet:looks like target is 44.0> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:looks like target is 44.0% in the last 12 months: https://t.co/vh\n",
            "__________________________________________________ Generating new training data example 6/30 for prompt : <negative-finance-tweet:Green Weekly Triangle on > __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:Green Weekly Triangle on November 19\n",
            "__________________________________________________ Generating new training data example 7/30 for prompt : <negative-finance-tweet:TE - new short on board f> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:TE - new short on board fyi: https://t.co/cKfSjhE2\n",
            "__________________________________________________ Generating new training data example 8/30 for prompt : <negative-finance-tweet:user Me too but NFX has b> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:user Me too but NFX has bumbled with its finances. In recent months it had just over $5 lakh\n",
            "__________________________________________________ Generating new training data example 9/30 for prompt : <negative-finance-tweet:Green Weekly Triangle on > __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:Green Weekly Triangle on Scribd\n",
            "\n",
            "And after the elections, it would probably have been even worse—since most\n",
            "__________________________________________________ Generating new training data example 10/30 for prompt : <negative-finance-tweet:JPM bounced off it's fib > __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:JPM bounced off it's fibs [14:03:10] <JW_Ew_B\n",
            "__________________________________________________ Generating new training data example 11/30 for prompt : <negative-finance-tweet:FIO this action in this s> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:FIO this action in this sordid and disgusting way is only the tip of the iceberg. https://t\n",
            "__________________________________________________ Generating new training data example 12/30 for prompt : <negative-finance-tweet:BAC when she loses the 50> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:BAC when she loses the 50 percent prize in November. For the campaign to reach voters, the campaign must successfully\n",
            "__________________________________________________ Generating new training data example 13/30 for prompt : <negative-finance-tweet:ANF weekly kumo. Note Vol> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:ANF weekly kumo. Note Volker's twitter handle with a simple, but very explicit, disclaimer that \"\n",
            "__________________________________________________ Generating new training data example 14/30 for prompt : <negative-finance-tweet:NEW POST: AAP Has Disappo> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:NEW POST: AAP Has Disappoiled Trump-Dossier, But Why Its Success Won't Happen Again Until\n",
            "__________________________________________________ Generating new training data example 15/30 for prompt : <negative-finance-tweet:AAP in time will retest p> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:AAP in time will retest p4.0, with the option for re-encode/recurse\n",
            "__________________________________________________ Generating new training data example 16/30 for prompt : <negative-finance-tweet:yum down premarket throug> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:yum down premarket throughen:no\n",
            "\n",
            "The problem was simple because most consumers were left with\n",
            "__________________________________________________ Generating new training data example 17/30 for prompt : <negative-finance-tweet:pnra missed this one yest> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:pnra missed this one yestry bit- of that, I thought someone else was probably going to have it\n",
            "__________________________________________________ Generating new training data example 18/30 for prompt : <negative-finance-tweet:Wipro March Quarter Profi> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:Wipro March Quarter Profi…\n",
            "\n",
            "We are thrilled to announce that our third quarter marketing strategies have been\n",
            "__________________________________________________ Generating new training data example 19/30 for prompt : <negative-finance-tweet:Cognizant Withdraws 2020 > __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:Cognizant Withdraws 2020 Money-Threatening Policy: Bitch\n",
            "\n",
            "\"The Trump administration\n",
            "__________________________________________________ Generating new training data example 20/30 for prompt : <negative-finance-tweet:GIS - this is what all th> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:GIS - this is what all threesomes look like [31/12/2014, 6:25:\n",
            "__________________________________________________ Generating new training data example 21/30 for prompt : <negative-finance-tweet:PI 3Q positive0Q operatio> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:PI 3Q positive0Q operatio po (n/a) –P=P1 –F (P\n",
            "__________________________________________________ Generating new training data example 22/30 for prompt : <negative-finance-tweet:  SPY,IYT, FDX, PS A clos> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet: SPY,IYT, FDX, PS A closer is no longer available at Costco, Costco doesn\n",
            "__________________________________________________ Generating new training data example 23/30 for prompt : <negative-finance-tweet:The most bearish thing I > __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:The most bearish thing I have ever heard from Trump is that he doesn't like the people he is telling them\n",
            "__________________________________________________ Generating new training data example 24/30 for prompt : <negative-finance-tweet:Whiting, one of top drill> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:Whiting, one of top drillers was suspended from the drilling club in July 2012 after posting video of the company\n",
            "__________________________________________________ Generating new training data example 25/30 for prompt : <negative-finance-tweet:Most eading Stocks laggin> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:Most eading Stocks laggin'.\n",
            "\n",
            "By John Chapple\n",
            "\n",
            "\n",
            "\"With a lot of good trading\n",
            "__________________________________________________ Generating new training data example 26/30 for prompt : <negative-finance-tweet:user: AAP If TC announces> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:user: AAP If TC announces a vote on a budget, his Facebook page will have to be changed to say how\n",
            "__________________________________________________ Generating new training data example 27/30 for prompt : <negative-finance-tweet:For the past 8 years GOOG> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:For the past 8 years GOOGLE.COM has been a destination for individuals and small businesses using social media\n",
            "__________________________________________________ Generating new training data example 28/30 for prompt : <negative-finance-tweet:ssys if fundamentals matt> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:ssys if fundamentals mattheitner.com/wp-content/uploads/2017/11/10_\n",
            "__________________________________________________ Generating new training data example 29/30 for prompt : <negative-finance-tweet:Sensex falls over 200 poi> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:Sensex falls over 200 poi (around 3.5 cents) on average. (More: Sensex falls\n",
            "__________________________________________________ Generating new training data example 30/30 for prompt : <negative-finance-tweet:With thousands of flights> __________________________________________________\n",
            "Generated : \n",
            "  negative-finance-tweet:With thousands of flights cancelled for the week, the government also took to social media to express its dissatisfaction with the government\n",
            "__________________________________________________ Generating for label=positive No. 2/2 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/30 for prompt : <positive-finance-tweet:SGMO monthly  > __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:SGMO monthly-only, up to 20% off sales of stocks. No need to buy anything with a credit\n",
            "__________________________________________________ Generating new training data example 2/30 for prompt : <positive-finance-tweet:JNP - added to long> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:JNP - added to longlist - added @kimmytheshape\n",
            "__________________________________________________ Generating new training data example 3/30 for prompt : <positive-finance-tweet:long AAP 470.positive0 ho> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:long AAP 470.positive0 hoi.com/p/qzTqcQJK-F\n",
            "__________________________________________________ Generating new training data example 4/30 for prompt : <positive-finance-tweet:TWI like this long 24% sh> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:TWI like this long 24% shill story – #FakeNews and https://t.co/Q4\n",
            "__________________________________________________ Generating new training data example 5/30 for prompt : <positive-finance-tweet:DNDN waking up, almost GO> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:DNDN waking up, almost GOING to jail @JFreeman25. pic.twitter.com/\n",
            "__________________________________________________ Generating new training data example 6/30 for prompt : <positive-finance-tweet:OMEX sold some 3.52 buyin> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:OMEX sold some 3.52 buyin shares of UBS's shares on Friday, the company's second-\n",
            "__________________________________________________ Generating new training data example 7/30 for prompt : <positive-finance-tweet:AEE quiet refuge in the r> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:AEE quiet refuge in the rump when it comes to political news.\n",
            "\n",
            "\n",
            "He was one of 12 journalists\n",
            "__________________________________________________ Generating new training data example 8/30 for prompt : <positive-finance-tweet:GS exploded in the last h> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:GS exploded in the last hud, with @Gardner taking the #IWantTheDollarInK\n",
            "__________________________________________________ Generating new training data example 9/30 for prompt : <positive-finance-tweet:ong in BBBY after beating> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:ong in BBBY after beating the BMO Harris Bradley College football team and finishing 3-6\n",
            "\n",
            "As you\n",
            "__________________________________________________ Generating new training data example 10/30 for prompt : <positive-finance-tweet:FBHS - another #3WeeksTig> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:FBHS - another #3WeeksTigress moment. http://t.co/1dGm\n",
            "__________________________________________________ Generating new training data example 11/30 for prompt : <positive-finance-tweet:Buying back the goog i so> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:Buying back the goog i so wanted to buy a $$$back #finance pic.twitter.com\n",
            "__________________________________________________ Generating new training data example 12/30 for prompt : <positive-finance-tweet:AAP getting my attention.> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:AAP getting my attention. https://t.co/CkqFZkMuRv —\n",
            "__________________________________________________ Generating new training data example 13/30 for prompt : <positive-finance-tweet:OVI on watch list has vol> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:OVI on watch list has vol. 1 released!\n",
            "\n",
            "You have not specified any information.\n",
            "\n",
            "About\n",
            "__________________________________________________ Generating new training data example 14/30 for prompt : <positive-finance-tweet:SYNT full year 20positive> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:SYNT full year 20positive-f.com is up 8.6%.\n",
            "\n",
            "The report comes from an\n",
            "__________________________________________________ Generating new training data example 15/30 for prompt : <positive-finance-tweet:DDD needs to clear 35.80 > __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:DDD needs to clear 35.80 % of her Twitter followers and she needs to hit 50,000 shares within\n",
            "__________________________________________________ Generating new training data example 16/30 for prompt : <positive-finance-tweet:GTAT closed above its 50S> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:GTAT closed above its 50S average after Thursday trading, pushing it to $34.31. For all the\n",
            "__________________________________________________ Generating new training data example 17/30 for prompt : <positive-finance-tweet:AET DVA nice job shaking > __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:AET DVA nice job shaking this up, and letting the kids talk about how much they know about bitcoin as\n",
            "__________________________________________________ Generating new training data example 18/30 for prompt : <positive-finance-tweet:V just look at Visa, what> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:V just look at Visa, what are the worst credit card rates on all companies in the U.S.?\n",
            "\n",
            "__________________________________________________ Generating new training data example 19/30 for prompt : <positive-finance-tweet:DPZ nice intraday upward > __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:DPZ nice intraday upward rise in the yield curve.\n",
            "\n",
            "There's no telling how the economy as\n",
            "__________________________________________________ Generating new training data example 20/30 for prompt : <positive-finance-tweet:SCSS usually likes to mak> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:SCSS usually likes to mak the line between right and left. This will include 'The Donald & the Roc\n",
            "__________________________________________________ Generating new training data example 21/30 for prompt : <positive-finance-tweet:NVDA - to answer your que> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:NVDA - to answer your queasy questions. :)\n",
            "\n",
            "\n",
            "Here is more information: http://t.co\n",
            "__________________________________________________ Generating new training data example 22/30 for prompt : <positive-finance-tweet:NKD stormed back in week > __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:NKD stormed back in week 2-4, despite losing 13-0 to San Francisco . Now let's look\n",
            "__________________________________________________ Generating new training data example 23/30 for prompt : <positive-finance-tweet:DNDN Confirming nicely (c> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:DNDN Confirming nicely (c) January 10, 2016 at 10:53 AM   Bitch f—\n",
            "__________________________________________________ Generating new training data example 24/30 for prompt : <positive-finance-tweet:ACX Today's P reads very > __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:ACX Today's P reads very similar to PQ-1 and P Q2, and in fact would require\n",
            "__________________________________________________ Generating new training data example 25/30 for prompt : <positive-finance-tweet:I need alka seltzer to sp> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:I need alka seltzer to spay my daughter. pic.twitter.com/k9L5\n",
            "__________________________________________________ Generating new training data example 26/30 for prompt : <positive-finance-tweet:user The day of the MA IP> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:user The day of the MA IP address: 71736.0.5.1:717764 (\n",
            "__________________________________________________ Generating new training data example 27/30 for prompt : <positive-finance-tweet:AMZN remember holding 269> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:AMZN remember holding 269 \"tax savings\" between 2005 and 2010 and that for the most part the gains came\n",
            "__________________________________________________ Generating new training data example 28/30 for prompt : <positive-finance-tweet:MCP either way, it's just> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:MCP either way, it's just like 'hey man, it was good to see you in Chicago, that\n",
            "__________________________________________________ Generating new training data example 29/30 for prompt : <positive-finance-tweet:KSS has been on a nice up> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:KSS has been on a nice upswing lately, especially since Trump came to power. We could see this trend\n",
            "__________________________________________________ Generating new training data example 30/30 for prompt : <positive-finance-tweet:NW - ong  positive9.75. T> __________________________________________________\n",
            "Generated : \n",
            "  positive-finance-tweet:NW - ong positive9.75. Ties from 2.10.9.15 that ended 4x\n",
            "Augmented Training Dataset Size = 120\n",
            "Training Augmented Classifierr\n",
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "Training Settings : Epochs=5, learn_rate=0.0005, batch_size=32, dropout=0.5\n",
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n",
            "__________________________________________________ Metrics on vanilla Train datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    negative       0.89      0.65      0.75        48\n",
            "    positive       0.75      0.93      0.83        54\n",
            "\n",
            "    accuracy                           0.79       102\n",
            "   macro avg       0.82      0.79      0.79       102\n",
            "weighted avg       0.81      0.79      0.79       102\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on vanilla Train datataset with VANILLA Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    negative       0.55      0.59      0.57       270\n",
            "    positive       0.55      0.51      0.53       270\n",
            "\n",
            "    accuracy                           0.55       540\n",
            "   macro avg       0.55      0.55      0.55       540\n",
            "weighted avg       0.55      0.55      0.55       540\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Test datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    negative       0.94      0.67      0.78        87\n",
            "    positive       0.76      0.96      0.85        98\n",
            "\n",
            "    accuracy                           0.82       185\n",
            "   macro avg       0.85      0.81      0.81       185\n",
            "weighted avg       0.84      0.82      0.82       185\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Test datataset with VANILLA Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    negative       0.83      0.79      0.81        48\n",
            "    positive       0.82      0.85      0.84        54\n",
            "\n",
            "    accuracy                           0.82       102\n",
            "   macro avg       0.82      0.82      0.82       102\n",
            "weighted avg       0.82      0.82      0.82       102\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Augmented Train datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            "    negative       0.54      0.56      0.55       270\n",
            "    positive       0.54      0.52      0.53       270\n",
            "\n",
            "    accuracy                           0.54       540\n",
            "   macro avg       0.54      0.54      0.54       540\n",
            "weighted avg       0.54      0.54      0.54       540\n",
            "\n",
            "______________________________________________________________________________________________________________________________________________________\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Data Augmentation on a Medical Dataset\n",
        "\n",
        "\n",
        "Using data from https://www.kaggle.com/datasets/draaslan/covid19-research-papers-dataset?select=papers.csv \n",
        "\n",
        "![img](https://law.mpg.de/wp-content/uploads/COVID-19-1920x1080.jpeg)\n"
      ],
      "metadata": {
        "id": "7OpctN_h1wbG",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Medical Paper Abstract Label Classification\n",
        "# https://www.kaggle.com/datasets/draaslan/covid19-research-papers-dataset?select=papers.csv \n",
        "\n",
        "! wget http://ckl-it.de/wp-content/uploads/2022/09/cleaned_data.csv\n",
        "\n",
        "train_df = pd.read_csv('cleaned_data.csv')\n",
        "\n",
        "train_df = train_df[['title','journal']]\n",
        "train_df.columns = ['text','y']\n",
        "train_df.y.value_counts().plot.barh(figsize=(20,16), title='Label Distribution')\n",
        "train_df.head(5)\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "id": "bdw0uIrC5JeU",
        "outputId": "4a84c329-f7ad-4766-f57d-f6370eef36d9",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-09-04 11:43:04--  http://ckl-it.de/wp-content/uploads/2022/09/cleaned_data.csv\n",
            "Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209\n",
            "Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 8042091 (7.7M) [text/csv]\n",
            "Saving to: ‘cleaned_data.csv’\n",
            "\n",
            "cleaned_data.csv    100%[===================>]   7.67M  42.2MB/s    in 0.2s    \n",
            "\n",
            "2022-09-04 11:43:04 (42.2 MB/s) - ‘cleaned_data.csv’ saved [8042091/8042091]\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                                                                                                           text  \\\n",
              "0          Factors Influencing Sleep Quality among Female Staff Nurses during the Early COVID-19 Pandemic in the United States.   \n",
              "1   Beyond the Pale: Dark Traits and Close Relations Influence Attitudes toward COVID-19 and the Rejection of Quarantine Rules.   \n",
              "2        COVID-19 Medical Vulnerability Indicators: A Predictive, Local Data Model for Equity in Public Health Decision Making.   \n",
              "3  Physical Activity and Perceived Physical Fitness during the COVID-19 Epidemic: A Population of 40- to 69-Year-Olds in Japan.   \n",
              "4    Lifestyle Effects on the Risk of Transmission of COVID-19 in the United States: Evaluation of Market Segmentation Systems.   \n",
              "\n",
              "                                                                   y  \n",
              "0  International journal of environmental research and public health  \n",
              "1  International journal of environmental research and public health  \n",
              "2  International journal of environmental research and public health  \n",
              "3  International journal of environmental research and public health  \n",
              "4  International journal of environmental research and public health  "
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-10e3f494-64cb-4d8d-9d8b-3bfae62131a4\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>y</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Factors Influencing Sleep Quality among Female Staff Nurses during the Early COVID-19 Pandemic in the United States.</td>\n",
              "      <td>International journal of environmental research and public health</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Beyond the Pale: Dark Traits and Close Relations Influence Attitudes toward COVID-19 and the Rejection of Quarantine Rules.</td>\n",
              "      <td>International journal of environmental research and public health</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>COVID-19 Medical Vulnerability Indicators: A Predictive, Local Data Model for Equity in Public Health Decision Making.</td>\n",
              "      <td>International journal of environmental research and public health</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Physical Activity and Perceived Physical Fitness during the COVID-19 Epidemic: A Population of 40- to 69-Year-Olds in Japan.</td>\n",
              "      <td>International journal of environmental research and public health</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Lifestyle Effects on the Risk of Transmission of COVID-19 in the United States: Evaluation of Market Segmentation Systems.</td>\n",
              "      <td>International journal of environmental research and public health</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-10e3f494-64cb-4d8d-9d8b-3bfae62131a4')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-10e3f494-64cb-4d8d-9d8b-3bfae62131a4 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-10e3f494-64cb-4d8d-9d8b-3bfae62131a4');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 52
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 1440x1152 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAABbcAAAOVCAYAAABatd19AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdd7TlZX3v8c9XxmABsYCKRJ2oiF1UsKJB5aoRe8MSlWhizDW2xHi9poixYUgxamIs1wpiQVECXMWoKGKhSBfQG4EYLFGiKCJG4Xv/2L8TN4dz5pwZhhkeeL3WOmv2+e1fefZvn7MWvOeZZ1d3BwAAAAAARnKNzT0AAAAAAABYX+I2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAADgClJVR1bV727qY6fj719VZ27o8Uuc7/9W1TOnx3tX1Rc24rmfVlVHbKzzAQBXD+I2AADACqrq7KraY3OPY0FV7VNVv6iqn0xfX6+qN1fV9gv7dPdR3b3TKs+1/0r7dfdvdfd7NsLY11ZVV9WauXMf0N0PubznBgCuXsRtAACAMX2wu7dOcsMkj01y0yTHzwfujaFm/L8jAHCl4z9QAAAANlBV3aCqDq2q71fVD6fHv75ot1tX1TFV9eOq+nhV3XDu+HtX1Rer6kdVdVJV7b6+Y+juX3T3aUn2SvL9JH88nXv3qvr3uWv9r6o6d5rpfWZVPbiqHpbk5Un2qqoLquqkad8jq+o1VXV0kguT3GqJZVJqmi1+flWdUVUPnnviUjPdF80O//z054+ma95n8TInVXXfqjp2OvexVXXfueeOrKpXVdXR02s5oqq2Xd/7BgCMT9wGAADYcNdI8q4kt0xyiyQ/S/LmRfs8I8mzkmyf5JdJ3pgkVbVDksOSvDqz2dcvSfKRqtpuQwbS3Rcn+XiS+y9+rqp2SvKHSXadZns/NMnZ3f2JJK/NbBb4Vt1917nDnp7kOUm2TnLOEpe8V5J/TbJtklck+eh8uF+HB0x/Xn+65pcWjfWGmd2XNya5UZK/TXJYVd1obrenJvmdJDdO8muZ3TsA4GpG3AYAANhA3X1ed3+kuy/s7p8keU2S31y02/u6+9Tu/mmSP0/ypKraIslvJzm8uw/v7ku6+1NJjkvy8MsxpG9nFsoXuzjJlknuUFXX7O6zu/tfVzjXu7v7tO7+ZXf/Yonn/yPJG6aZ4x9McmaSPS/H2BfsmeQb3f2+6doHJjkjySPn9nlXd3+9u3+W5ENJdt4I1wUABiNuAwAAbKCquk5VvbWqzqmqH2e25Mb1p3i94Ftzj89Jcs3MZjvfMskTpyVJflRVP0qyW2YzvDfUDkn+c/HG7v5/SV6UZJ8k/1FVH6iqm61wrm+t8Py53d1z35+TZKVzrsbNctmZ4udk9toWfHfu8YVJttoI1wUABiNuAwAAbLg/TrJTknt19/XyqyU3am6fm889vkWSXyT5QWbx+H3dff25r+t2974bMpDpQx8fmeSopZ7v7vd3926ZRfVO8vqFp5Y55XLbF+xQVfOv8xaZzRxPkp8muc7cczddj/N+exrjvFskOXeF4wCAqxlxGwAAYHWuWVXXmvtak9l61D/L7MMRb5jZ2tOL/XZV3aGqrpPkL5McNK2PvX+SR1bVQ6tqi+mcuy/xgZTrVFVrqur2SQ7MLCL/7RL77FRVD6qqLZNcNI35kunp7yVZO8Xx9XHjJC+oqmtW1ROT3D7J4dNzJyZ58vTcLkmeMHfc96dr32qZ8x6e5LZV9dTpte2V5A5JDl3P8QEAV3HiNgAAwOocnlkUXvjaJ8kbklw7s5nYX07yiSWOe1+Sd2e2lMa1krwgSbr7W0keneTlmQXfbyX5k6z+/9P2qqoLkpyf5JAk5yW5R3d/e4l9t0yy7zTO72YWpv/39NyHpz/Pq6qvrvLaSfKVJDtO53xNkid093nTc3+e5NZJfpjklUnev3BQd1847X/0tBzLvedPOp3jEZnNij8vyUuTPKK7f7AeYwMArgbq0kukAQAAAADAlZ+Z2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDWbO4BAFyVbLvttr127drNPQwAAACAq4zjjz/+B9293eLt4jbARrR27docd9xxm3sYAAAAAFcZVXXOUtstSwIAAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGs2ZzDwDgquSUc8/P2pcdtrmHAQAAAFxNnb3vnpt7CJuMmdsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI27DJlJVF1fViXNfazfCOV9UVdeZ+/7wqrr+5T3voms8qqpetjHPuaGqaveqOnQ9jzmyqna5osYEAAAAwOaxZnMPAK5GftbdOy/1RFVVkuruS9bznC9Ksn+SC5Okux++PgdX1RbdffG69unuQ5Icsp7jAgAAAIArlJnbsJlU1dqqOrOq3pvk1CQ3r6r9qurUqjqlqvaa9tt9mn18UFWdUVUH1MwLktwsyWer6rPTvmdX1bbT49+uqmOmWeJvraotpu0XVNXfVNVJSe5TVftW1deq6uSq+uslxrl3Vb15evzuqnpjVX2xqr5ZVU9Y5nUtjPP0adzXmZ671LWqauuqOquqrjk9f72F76vqNlX1L1V1UlV9tapuPV1iq8X3Yjr2wVV1wnTv3llVWy4xtqdMz59aVa+f2/7sqvr6dL/eXlVvXtfYNuwdBwAAAGBjErdh07n23JIkB0/bdkzyj919xyS7JNk5yV2T7JFkv6raftrvbpnN0r5DklsluV93vzHJt5M8sLsfOH+hqrp9kr2m/XZOcnGSp01PXzfJV7r7rklOT/LYJHfs7rskefUqXsf2SXZL8ogk+y6zz07T67p9kh8n+Z9VdaPF1+runyQ5Msme03FPTvLR7v5FkgOS/MM0zvsm+c5y96KqrpXk3Un26u47Z/avUv5g0T25WZLXJ3lQZvd516p6zLT9z5PcO8n9ktwuSVYYGwAAAACbmbgNm87Punvn6eux07ZzuvvL0+PdkhzY3Rd39/eSfC7JrtNzx3T3v0/LlpyYZO0K13pwknskObaqTpy+v9X03MVJPjI9Pj/JRUn+T1U9LtPyJiv4WHdf0t1fS3KTZfb5VncfPT3ef3pty13rHUl+Z3r8O0neVVVbJ9mhuw9Oku6+qLsX9l/qXuyU5Kzu/vq0z3uSPGDRmHZNcmR3f7+7f5lZPH9Aknsm+Vx3/+cUrj88d8xlxrbUi62q51TVcVV13MUXnr/MLQEAAABgYxK3YfP66Sr3+/nc44uz8nr5leQ9czF9p+7eZ3ruooV1tqfIe88kB2U2E/sT6zmWWmafXvz9cteaIvjaqto9yRbdfep6XH8192KDrXZs3f227t6lu3fZ4jrbXFHDAQAAAGCOuA1XHkcl2auqtqiq7TKbVXzMCsf8JMnWS2z/dJInVNWNk6SqblhVt1y8U1VtlWSb7j48yYszWxJlY7hFVd1nevzUJF9Y4VrvTfL+TDOjpyVB/r2qHjONc8uFdbuXcWZmEfo20/dPz2zm+7xjkvxmVW07rT/+lGmfY6ftN6iqNUkev+i4S40NAAAAgCsHcRuuPA5OcnKSk5J8JslLu/u7KxzztiSfWPhAyQXTkiF/luSIqjo5yacyWyt7sa2THDrt84Ukf3T5XsJ/OzPJ86rq9CQ3SPKWFa51wLTfgXPbnp7kBdP+X0xy0+Uu1t0XZbZsyIer6pQklyT5p0X7fCfJy5J8NrN7fHx3f7y7z03y2szi99FJzs5sCZV1jQ0AAACAzay6F68eALDhqmptkkO7+07rccwTkjy6u59+RY1rhetv1d0XTDO3D07yzoX1vtd3bFtuv2Nv/8w3XIGjBQAAAFje2fvuubmHsNFV1fHdvcvi7VfYWrUAq1FVb0ryW0kevhmHsU9V7ZHkWkmOSPKxK9HYAAAAAFiCuA1sVN19dpJVz9ru7udfcaNZ9Rhessz2zT42AAAAAJZmzW0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOGs29wAArkruvMM2OW7fPTf3MAAAAACu8szcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIazZnMPAOCq5JRzz8/alx22uYcBAAAAXE2dve+em3sIm4yZ2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxe0BVdWRV7TI9PruqTqmqk6vqc1V1yxWOvVlVHbSRxvHyucdrq+rUjXHeK1pVvaOq7rDCPo9ZaZ8rg6q6f1WdVlUnVtW1N/AcF2yksexeVYfOPb7v3HPvrqonbIzrTOfbpareuLHOBwAAAMB4xO2rhgd2912SHJnkz9a1Y3d/u7s3VmR8+cq7bDxVtWYjnGOL7v7d7v7aCrs+JslGjduXd/w1s/h39mlJXtfdO3f3z67oMayH3ZPcd6WdNlR3H9fdL7iizg8AAADAlZ+4vQlMs5rPmGavfr2qDqiqParq6Kr6RlXdc9rvulX1zqo6pqpOqKpHT9uvXVUfqKrTq+rgJMvN0P1Skh2mY3adZnNfazrvaVV1p/kZ1lX15aq649w4/3tG+Cpe075Jrj3NGD5g2rxFVb19utYRCzOJq+rWVfWJqjq+qo6qqtstcb59qup9VfWl6Z783rR99+mYQ5J8raq2qKr9qurY6fX9/tx+n6+qw6rqzKr6p4UQXFUXVNXfVNVJSe6zaOb7BVX1mqo6abofN5lmHD8qyX7T67v1orE+sapOnY75/LRtXeOaH/++VfW8Ra/7JdPjP5k7/pXTtrXT63lvklOT3Hzu2N9N8qQkr5p+pmoaw6k1m82/11JjWOb9vNQ9mLZtV1UfmcZ0bFXdb9p+z+l9OqGqvlhVOy0619okz03y4un+3X966gHT/t+sJWZxTz+nh03jOHVu/LtOx500/W5sXZeeJb7c783eVfXR6WfvG1X1V3PXelhVfXU656dXOM8dp20nTu/NjkvdQwAAAAA2rU01i5PkNkmemORZSY5N8tQku2UWUV+e2UzhP03yme5+VlVdP8kxVfUvSX4/yYXdffuqukuSry5zjYcl+ViSdPexU8x8dWYxfP/uPnUKjws+mFkcfUVVbZ9k++4+bv6EVXWzJO/o7ofPb+/ul1XVH3b3ztN+a5PsmOQp3f17VfWhJI9Psn+StyV5bnd/o6ruleQfkzxoifHfJcm9k1w3yQlVddi0/e5J7tTdZ1XVc5Kc3927VtWWSY6uqiOm/e6Z2Wzrc5J8Isnjkhw0ne8r3f3H01jnr3ndJF/u7j+d4ufvdferp3t3aHcvtYTLXyR5aHefO71PSfLsdYxrfvx3S/KGJP8wPfekJA+tqodM9++eSSrJIVX1gCT/Nm1/Znd/edF78I6q2m1hnFX1+CQ7J7lrkm2THLsQ3+fHsMTrucw9yOzn5u+T/F13f6GqbpHkk0lun+SMJPfv7l9W1R5JXpvZe70wrrOr6p+SXNDdfz3d82cn2T6zn/nbJTlkem/mPSzJt7t7z+mYbarq1zL7Od1r+pm+XpLFM9SX+73JdD/uluTnSc6sqjcluSjJ25M8YHpPbrjCeZ6b5O+7+4BpPFsscQ8BAAAA2MTE7U3nrO4+JUmq6rQkn+7urqpTkqyd9nlIkkctzORNcq0kt0jygCRvTJLuPrmqTl507s9Oge6CJH8+t/0vMwvpFyVZagmHDyU5IskrMouslwm53f3tJA9fvH0dr/HE6fHxSdZW1VaZLU/x4bmovOUyx398WlrjZ1X12cxC74+SHDMXZR+S5C5zM3+3ySz+/te03zeTpKoOzCykHpTk4iQfWeaa/5Xk0Lkx/49VvM6jk7x7CvgfXeW4zkqS7j6hqm48/aXBdkl+2N3fqqoXTuc4YTp+q+n4f0tyzuKwvYzdkhzY3Rcn+V5VfS7Jrkl+nEvfw8WWuwd7JLnD3Pt2ven93CbJe6YZzJ3kmqsYW5J8rLsvyWwG+02WeP6UJH9TVa/PLNgfVVV3TvKd7j42Sbr7x8ll/oJiud+bZPZ7dv50zNeS3DLJDZJ8fu49+c8VzvOlJH9aVb+e5KPd/Y3FA5/+0uU5SbLF9bZb5e0AAAAA4PIQtzedn889vmTu+0vyq/ehkjy+u8+cP3BRyFvKAzOLwAckeWWSP5q23yizSHrNzELdT+cPmmYenzfNBt8rsxmql8f8a7w4sxnj10jyo4UZ3ivoZb6fH3cleX53f3J+x6rafR3HXzQF36X8orsX9rs4q/id6O7nTjPQ90xyfFXdY4Vx/XTRKT6c5AlJbprZrOSF1/W67n7rouPXLnH8hljXOZa7B9dIcu/uvmjRmN6c5LPd/dhpfEeucgzzPx+X+aHu7q9X1d0z+8uUV0/LhRy8ivMu93tzr1z2Z3Jd7++S50lyelV9JbP3+/Cq+v3u/syisb8ts3+hkC2333HxzyEAAAAAVwBrbl+5fDLJ82uq2dMSFkny+cyWMUlV3Smz5Tsupbt/meRFSZ4xt8zCWzObyX1Aktcvc80PJnlpkm26e/GM8JX8oqrWOWt3mml7VlU9cRp/VdVdl9n90TVbI/xGmX0g4bFL7PPJJH+wcN2qum1VXXd67p5V9Rs1W2t7ryRfWM/XM+8nSbZe6omqunV3f6W7/yLJ9zNbB3td41rsg0menFng/vDc63rWNDM6VbVDVd14Pcd8VJK9arb+93aZzfg/Zj3PMe+IJM9f+KaqFv6CYpsk506P917m2GXv33Km2ewXdvf+SfbLbCmVM5NsX1W7TvtsXZf9UMzlfm+W8+XM1v/+jWn/hd+XJc9TVbdK8s3ufmOSj2eJ3z8AAAAANj1x+8rlVZnNsj55WrrkVdP2tyTZqqpOz2ypkeOXOri7v5PkwCTPq6pnZDYj9/1J9k2ya1Uttc71QZmF1g8tdc6qullVHb7MeN82jfWAZZ5f8LQkz67ZBzqeluTRy+x3cpLPZhYfXzUtibLYOzL7UMSv1uyDMd+aX83GPTbJm5OcnuSsrG7W73I+kORPavbBgrde9Nx+NfvAxlOTfDHJSSuM61K6+7TMwu+503uW7j4iyfuTfGlaquagrGcczuz1njyN5zNJXtrd313Pc8x7QZJdavYhil/Lr2b2/1WS11XVCVl+JvQ/J3lsXfoDJVdy58zWuT4xs6VyXt3d/5XZX1S8afr5+VRm/wph3nK/N0vq7u9ntoTIR6dzLsyeX+48T0py6jSuOyV57ypfDwAAAABXoPrVagSw+VTVPpn7AMINOH73JC/p7kdszHHB+tpy+x17+2e+YXMPAwAAALiaOnvfPTf3EDa6qjq+u3dZvN3MbQAAAAAAhuMDJblS6O59LufxR2b1H2wIAAAAAAzOzG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADGfN5h4AwFXJnXfYJsftu+fmHgYAAADAVZ6Z2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAreIaIAACAASURBVAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIazZnMPAOCq5JRzz8/alx22uYcBAAAAXE2dve+em3sIm4yZ2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxG1ahqi7YTNfdvaoOXc9jDqyqk6vqxZtiXFX1qKp62Qae5+yq2nY99v/iep5/ve8fAAAAAGNYs7kHAFcXVbVFd198BV/jpkl27e7bXJHXmdfdhyQ5ZBNd676Lt1XVmu7+5aa4PgAAAABXHmZuwyrVzH5VdWpVnVJVe03bLzU7uKreXFV7T4/PrqrXV9VXkzxx+v6VVfXV6Ry3m/a7Z1V9qapOqKovVtVOK4zlWlX1rukcJ1TVA6enjkiyQ1WdWFX3X3TMu6vqLVX15ar65jTud1bV6VX17rn9HjKN5atV9eGq2mra/rCqOmN6LY+b23/vqnrz9PgmVXVwVZ00fd132v6xqjq+qk6rques8NqeW1X7LXP+C+bu+VFVdUiSr63jfsyf94bTOE6e7sFdpu3bVdWnprG9o6rOqaptq+ovq+pFc8e/pqpeuK6xAwAAALDpiNuweo9LsnOSuybZI8l+VbX9Ko47r7vv3t0fmL7/QXffPclbkrxk2nZGkvt3992S/EWS165wzucl6e6+c5KnJHlPVV0ryaOS/Gt379zdRy1x3A2S3CfJizObbf13Se6Y5M5VtfO0RMifJdljGuNxSf5oOvfbkzwyyT2S3HSZcb0xyee6+65J7p7ktGn7s7r7Hkl2SfKCqrrROl7bR5I8du77vZJ8YIn97p7khd1923Xcj3mvTHJCd98lycuTvHfa/ookn+nuOyY5KMktpu3vTPKMJKmqayR5cpL91zFuAAAAADYhy5LA6u2W5MBpaZHvVdXnkuya5McrHPfBRd9/dPrz+PxqBvQ2mQXZHZN0kmuuYixvSpLuPqOqzkly21WM5Z+7u6vqlCTf6+5TkqSqTkuyNsmvJ7lDkqOrKkl+LcmXktwuyVnd/Y1p//2TLDUD+0GZgvB0n86ftr+gqhaC9c2T7JjkvKUG2N3fn2aW3zvJN6ZrH73Ersd091kr3I95uyV5/LTPZ6rqRlV1vWn7Y6ftn6iqH06Pz66q86rqbkluklkYX3LM02z05yTJFtfbbqldAAAAANjIxG24/H6ZS/8riMUzhn+66PufT39enF/9Dr4qyWe7+7FVtTbJkRt3iJe59iVzjxe+XzON6VPd/ZT5g6pq5w29YFXtntlM9/t094VVdWQue48W+0CSJ2U2o/3g7u4l9ll8X68I70iyd2Yz1d+53E7d/bYkb0uSLbffcamxAgAAALCRWZYEVu+oJHtV1RZVtV2SByQ5Jsk5Se5QVVtW1fWTPHgDzr1NknOnx3uvcixPS5Kqum1mS2mcuQHXXezLSe5XVbeZzn3d6fxnJFlbVbee9nvKMsd/OskfTMduUVXbZPbafjiF7dslufcqxnFwkkdP11lqSZLFVnM/5vfZPbPlYX6c2azwJ03bH5LZ0i3z43hYZjP0P7mKcQAAAACwiYjbsIKqWpPZLOeDk5yc5KQkn0ny0u7+bnd/K8mHkpw6/XnCBlzmr5K8rqpOyOr+RcU/JrnGtLzIB5Ps3d0/X+GYFXX39zOL6wdW1cmZliTp7osyW3bjsOkDJf9jmVO8MMkDp3Edn9kSJ59IsqaqTk+yb2YBfaVx/DDJ6Ulu2d3HrGLoq7kf+/z/9u492ta6rvf45yvIRVRMJdviBU3MQSrILc1LoGQlpnXE0CjBLI8Nh7fyeMgainks0kxTvBwwRc2jeD+kpRKCmorC5iI3yY5QeL+CNxSE3/nj+S2YLNZae629195r/zav1xh77Dmf+cz5/J4517MmvPczfzPJfn2/jk1yZF/+oiSPrKoLkjw+ydeSfL+P4+okpyV5Z59mBQAAAICtRC38aX9gTlXtneSE1tqBaz0WVl9V7Zjk2tbaT6vqQUle11rbp992iyRnJ3n83HzjG7Ljuj3buiNfufkGDAAAALCEy449dK2HsOqqan1rbf/5y825DUuoqqcleWaSZ6/1WNhs7pbknT1kX53kj5KkqvZK8oFMc34vK2wDAAAAsOWI27CE1trrk7x+rcfB5tPD9QMWWH5Rkntu+REBAAAAsBzm3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcLZf6wEAbEvut/uuOevYQ9d6GAAAAADbPGduAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGM72az0AgG3J+V++Mnsc/cG1HgYAAABwM3XZsYeu9RC2GGduAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtWIGquraqzq2q86rq7Kr65b58j6pqVfW/Zta9Y1VdU1XH9evHVNVzF3ncZ1fVk2auP7eqPt+3debcbVV1elXt3y//c1XdbgPj/cuqOmQj9nOPqrpgpfdbC1V11NxzvAqPteB+V9UOVfXxqtp+NbYDAAAAwKYTt2Flrmqt7dNa2zvJnyX565nbLk1y6Mz1xye5cEMP2IPpHyT5P/3605L8apIDW2v7JHlEkpp/v9bao1prVyz12K21F7TW/nVDY9gYWzr0rmVYbq1dneTUJIev1RgAAAAAuDFxGzbebZN8d+b6j5JcPHdmdaYQ+s5lPM7Dk5zdWvtpv/78JH/cWvtekrTWvtdae/P8O1XVZf3s8D2q6uKqOqGqLqyqj1TVzn2dE6vqsH75gKr6VD/r/LNVdZt+30/0s9CvPxN9MVV1UF//5CQXVdV2VfWyfnb556rqv/f11vUznc+tqguq6qF9+SOr6tN9W++qqlv35S/oj3FBVR1fVdWXn15Vr6yqs5I8a6F96EO7c1V9qKq+UFUvXWTs+1XVx6pqfVV9uKrWzSw/r6rOS/L0JXb//UmOWOr5AQAAAGDLEbdhZXbuwfbzSd6Q5MXzbn9HkidU1V2TXJvkK8t4zAcnWZ8kVXXbJLdprX1xhePaM8lrWmu/mOSKJI+bvbGqdkhyUpJn9bPOD0lyVZJvJPnV1tq+mWL8q5axrX3749w7yVOSXNlaOyDJAUn+qKrukeR3k3y4n3m+d5Jzq+qOSf4iySF9e2cl+ZP+mMe11g5ord03yc5JHj2zvR1aa/snefUi+5Ak+/Tx3y/J4f35n93/W/b7H9Za2y/JG5O8pN/8piTP6I+5lAv6PgIAAACwFTB/LKzMVT3YpqoelOQtVXXfmds/lCl4fz1TiF2OdUku3sRxXdpaO7dfXp9kj3m3/0KSr7bWzkyms8GTpKp2SXJcVe2TKcbfexnb+mxr7dJ++ZFJ7j93dniSXTOF9jOTvLFH5fe31s6tql9JsleST/YTs3dI8ul+v4Or6nlJbpXk9pmmc/mnftvc87jYPiTJqa21K/v1i5LcPcnl8/b/vklO6etvl+Srfc7y27XWPt7Xe2uS31hop1tr11bV1VV1m9ba92dvq6qnJnlqkmx3292Weu4AAAAAWCXiNmyk1tqn+9nIu80su7qq1if500wh9zHLeKirkuzU7/+9qvpBVd1zhWdv/2Tm8rWZzn5ejudkCvF7Z/okx4+XcZ8fzlyuTGc9f3j+SlX1sExzkJ9YVX+XaQqXU1prT5y33k5JXptk/9ba5VV1TPrzscD2FjN//+f/bqskF7bWHjRv20t+IecCdswCz1Fr7fgkxyfJjuv2bCt8TAAAAAA2gmlJYCNV1X0ynQH87Xk3vTzJ/2ytfWeZD3VxknvNXP/rJK/pU5Skqm5dVU/axOFekmRdVR3QH/M2/Qsad810NvR1SX4/0/6sxIeT/HE/QztVde+q2qWq7p7k6621EzJN37JvkjOSPLiq7tXX3aWq7p0bQva3+hzch91kK0vvw3JckmS3frZ9quqWVfWL/Qs5r6iqh/T1Fp1Tu6rukORbrbVrlrlNAAAAADYjZ27DyuxcVXPTf1SSI/t0Fdev0Fq7MNO0GvNtnxufYTznXzJNhzHndUluneTMqromyTWZgvlG62eUH57k1f3LJq/KNGf1a5O8p8fzD2V5Z0nPekOmKVDO7l8C+c0kv5XkoCT/o4//B0me1Fr7ZlUdleTtVbVjv/9ftNb+vapOyDSn9dcyTWmykn1Y7v4fluRVVbVrptfilZlepydnmkKlJfnI3H2q6s5J3tBae1RfdHCSDy5newAAAABsftWaT9DDllBV70tyQmvtnxe57XmttS9s+ZGxHFX13iRHt9b+fan1dly3Z1t35Cu30KgAAAAAbuyyYw9d6yGsuqpa31rbf/5y05LAFlBV5ye5LjNnBs9zdKYvlmQrVFU7ZPpizCXDNgAAAABbjmlJYAtord1vA7dfkmleaLZCrbWrk7xlrccBAAAAwA2cuQ0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4Wy/1gMA2Jbcb/ddc9axh671MAAAAAC2ec7cBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMJzt13oAANuS8798ZfY4+oNrPQwAAADgZuqyYw9d6yFsMc7cBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4WwwblfVD5axzrOr6larM6Qlt7NPVT1q5vpjqurozbCdBfe5qj612ttarqo6qKo+sML7vL2qPldVz9lMY7pzVb17czz2aqqq5y9zvcuq6o6bezybqqr2qKoLtsB2jqmq567SYy12TJ1YVYf1y2+oqr2W+XhHVdVxqzS22THc6HfZcn7/AQAAALA2VuvM7WcnWVHcrqrtNmI7+yS5Pm631k5urR27EY+zUVprv7waj7OR+77SbfxckgNaa/dvrb1ic2yjtfaV1tphC2x7+82xvU2wrLi9HFvitZvZ1tb2PG5WrbU/bK1dtMbDWPHvMgAAAADWxrLjdj9z+PSqendVfb6q3laTZya5c5LTquq0vu4jq+rTVXV2Vb2rqm7dl19WVX9TVWcneXy//qK+3vlVdZ++3oH9/udU1aeq6heqaockf5nk8Ko6t6oOnz17s5/N+tF+pvKpVXW3vvzEqnpVf5wvzpyheeu+3ty2H7uM5+AH/e+qqpdV1QX9vofPPEcfmFn/uKo6alP3fQNj2qmq3tQf45yqOrjf9JEku/fn6qHz7rNbVb2nqs7sfx7clx9TVW/sr/MX+2ubqjq2qp4+c/9jquq5s2cQ99fi5Kr6aJJTq+r2VfX+/nqcUVX338A29ug/VydW1b/3n69DquqTVfWFqjqwr7dLv/9n+/4+dmb7762qD/X1Xzo39iQ79+fhbX3Z+6tqfVVdWFVPXc7rXlUvr6rzkjyoqn6vb//cqvrfVbVd/3PizM/Ec/p9f76PaX1VfWLmdf7NqvpM34d/rao7zTw/b62qTyZ5a1XdqareV1Xn9T9z/8CyXVWd0PfhI1W18wLjXmobN3kN+m1/3p//f0uy4M9e38/XV9VZfd1Hz7wGx82s94GqOmjm+iv6eE+tqt0WeNzTq2r/fvnXazo2zquqUxd5ae48//Xu913s988L+s/7BVV1fFXVvO3f5HdZX/6SPo4z5p5DAAAAANbeSs/cfkCmMxv3SnLPJA9urb0qyVeSHNxaO7imaR3+IskhrbV9k5yV5E9mHuPbrbV9W2vv6Ne/1dd7XZK5KRA+n+ShrbUHJHlBkr9qrV3dL5/UWtuntXbSvLG9OsmbW2v3T/K2JK+auW1dkockeXSSuTO9f5zkt/u2D07y8vmxawn/LdNZ5HsnOSTJy6pq3TLut1H7voHHfHqS1lq7X5InJnlzVe2U5DFJ/l9/rj4x7z5/n+QVrbUDkjwuyRtmbrtPkl9LcmCSF1bVLZOclOR3Ztb5nb5svn2THNZa+5UkL0pyTn89np/kLRvYRpLcK8nL++33SfK7mV635+aGs6//PMlHW2sHZnrdXlZVu/Tb9klyeJL7ZfpHkLu21o5OclV/Ho7o6/1Ba22/JPsneWZV3WGBfZm1S5LPtNb2TvLtvo0Ht9b2SXJtkiP6tndvrd23vxZv6vc9Pskz+vaem+S1ffm/JXlgf53fkeR5M9vbK9Px88RMP8cf69veN8mFfZ09k7ymtfaLSa7I9DrOt9Q2bvIaVNV+SZ6QGz4hccASz8ke/b6HJnl9/5lbyi5Jzurj/ViSFy62Yg/fJyR5XN/vxy+y6k1e7w38/jmutXZAa+2+SXbO9PvgevN/l82M+4w+jo8n+aMN7CcAAAAAW8hKpz34bGvtS0lSVedmClz/Nm+dB2aKc5/srXiHJJ+euX1+FH1v/3t9pmicJLtmirR7JmlJbpkNe9DM/d+a5KUzt72/tXZdkotmzrysJH9VVQ9Lcl2S3ZPcKcnXlrGthyR5e2vt2iRfr6qPZQqB39vA/TbHvj8kU9hPa+3zVfWfSe69gbEckmSvmZZ/27mzW5N8sLX2kyQ/qapvJLlTa+2cqvrZqrpzkt2SfLe1dnlV7THvcU9prX1nZlyP6+P6aFXdoapuu9g2+vJLW2vnJ0lVXZjk1NZaq6rzM/2sJckjkzymbpgLeqckd+uXT22tXdnvf1GSuye5fIH9f2ZV/Xa/fNdMofjbSzxf1yZ5T7/8iCT7JTmzP387J/lGkn9Kcs+qenWSDyb5SH9OfznJu2ae6x3733dJclL/R5Edklw6s72TW2tX9csPT/KkJOk/b1dW1c/05+rcvs76medn1lLbWOg1eGiS97XWfpQkVXXyEs/JO/sx9YWq+mKmWL6U63LDz/8/5oaf/YU8MMnHW2uXJsnMz9R8C73et8viv38OrqrnZZp25PaZ/qHgnzYw7quTzH0aY32SX11opZo+AfDUJNnutjc5KR0AAACAzWClcfsnM5evXeT+lSlyPnGRx/jhIo85+3gvTnJaa+23e0A9fYXjnG923HOV8YhMoXa/1to1VXVZplC6KX6aG58NP//x1mLfF3KLTGf0/nh2YY+Bi73G70pyWJKfy8JnbSc33b/FLLaN2eXXzVy/bmadynRG7yXzxv5LSzzu7HoHZYr7D2qt/aiqTs+GX/cf97A8t/03t9b+bIHH3jvT2dBPy3R2+7OTPs2DBQAAC5JJREFUXNHP8J7v1Un+rrV2ch/TMTO3Led5nL+vN5mWZAPbWM6xvJS2wPUN/fwvdf+NsdA+LPj7p59Z/tok+/d/mDlmA+Obc01rbW6siz5PrbXjM52lnx3X7bka+wYAAADABqzWF0p+P8lt+uUzkjy4qu6VXD9H8r1X+Hi7Jvlyv3zUItuZ71OZplRIpnA9fyqOhbbxjR62D8501udyfSLTNAjb9SkUHpbks0n+M9MZ0TtW1e0yneW7Uovt+1JjOSJJ+vN8tySXLHmPaT7uZ8xdqaqF4ut8J2V6fg/LFLpXMq6DMk3BsqEz25fjw0meMTeFTFU9YBn3uWZm6pNdM515/qOa5r9+4Aq3f2qSw6rqZ/v2b19Vd+/TYdyitfaeTNNi7Nv399Kqenxft3oAnxvH3Ot85Aa298f9/ttV1a4rGOtytzHn40l+q6p2rqrbJPnNJdZ9fFXdoqp+PtMURZckuSzJPn35XTNNWzLnFpl+dpJpupn5n/iYdUaSh1XVPZLpOV7G2Gfvu9Dvn7mQ/a1+Rv1Nvgi1W+p3DAAAAABbkdWK28cn+VBVndZa+2amKPv2qvpcpikBNjRlwXwvTfLXVXVObnym5GmZ4vG51b/EccYzkjy5b/P3kzxrA9t4W5L9+5QXT8o01/WGzJ2R+b4kn0tyXpKPJnlea+1rrbXLk7wzyQX973OW8ZjzLbbvi3ltklv0/TgpyVF9uomlPDPTvn+uT+fwtA1tpLV2Yabo9+XW2leXMa5jkuzXX49js7y4uhwvzjRVy+f61CUvXsZ9ju/rvy3Jh5JsX1UX93GdsZKNt9YuyhSvP9L37ZRMc7rvnuT0Pl3PPyaZO7P7iCRPqenLKC9MMvfFpcdkmq5kfZJvLbHJZ2WaTuP8TNNi7LWC4S53G3P7dnamn6HzkvxLkjOXWP2/Mv2Dzr8keVr/FMAnM019clGmucLPnln/h0kOrOkLSB+e6cthFxvHNzNN8fHe/rwt9kmBxe57VOb9/mmtXZFpHu8LMv0DyWL7dv3vsuVuEwAAAIC1UTd84p6l9C8dPLu1tpIzvGGbU1UnJvlAa+3daz2WrdGO6/Zs64585VoPAwAAALiZuuzYQ9d6CKuuqta31vafv3y1ztzepvUvUvx0kr9d67EAAAAAALDyL5G7WWqtfSXJSucNh21Sa+2otR4DAAAAADhzGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADD2X6tBwCwLbnf7rvmrGMPXethAAAAAGzznLkNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwHHEbAAAAAIDhiNsAAAAAAAxH3AYAAAAAYDjiNgAAAAAAwxG3AQAAAAAYjrgNAAAAAMBwxG0AAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcMRtAAAAAACGI24DAAAAADAccRsAAAAAgOGI2wAAAAAADEfcBgAAAABgOOI2AAAAAADDEbcBAAAAABiOuA0AAAAAwHDEbQAAAAAAhiNuAwAAAAAwnGqtrfUYALYZVfX9JJes9ThgG3HHJN9a60HANsQxBavH8QSryzEFq2tbPKbu3lrbbf7C7ddiJADbsEtaa/uv9SBgW1BVZzmeYPU4pmD1OJ5gdTmmYHXdnI4p05IAAAAAADAccRsAAAAAgOGI2wCr6/i1HgBsQxxPsLocU7B6HE+wuhxTsLpuNseUL5QEAAAAAGA4ztwGAAAAAGA44jbAKqiqX6+qS6rqP6rq6LUeD4ygqu5aVadV1UVVdWFVPasvv31VnVJVX+h//0xfXlX1qn6cfa6q9l3bPYCtT1VtV1XnVNUH+vV7VNVn+nFzUlXt0Jfv2K//R799j7UcN2yNqup2VfXuqvp8VV1cVQ/yHgUbp6qe0/9774KqentV7eQ9Cpavqt5YVd+oqgtmlq34Pamqjuzrf6GqjlyLfVlt4jbAJqqq7ZK8JslvJNkryROraq+1HRUM4adJ/rS1tleSByZ5ej92jk5yamttzySn9uvJdIzt2f88NcnrtvyQYav3rCQXz1z/mySvaK3dK8l3kzylL39Kku/25a/o6wE39vdJPtRau0+SvTMdW96jYIWqavckz0yyf2vtvkm2S/KEeI+ClTgxya/PW7ai96Squn2SFyb5pSQHJnnhXBAfmbgNsOkOTPIfrbUvttauTvKOJI9d4zHBVq+19tXW2tn98vczRYPdMx0/b+6rvTnJb/XLj03yljY5I8ntqmrdFh42bLWq6i5JDk3yhn69kjw8ybv7KvOPp7nj7N1JHtHXB5JU1a5JHpbkH5KktXZ1a+2KeI+CjbV9kp2ravskt0ry1XiPgmVrrX08yXfmLV7pe9KvJTmltfad1tp3k5ySmwbz4YjbAJtu9ySXz1z/Ul8GLFP/uOkDknwmyZ1aa1/tN30tyZ36ZccaLO2VSZ6X5Lp+/Q5Jrmit/bRfnz1mrj+e+u1X9vWByT2SfDPJm/pUP2+oql3iPQpWrLX25SR/m+S/MkXtK5Osj/co2FQrfU/aJt+rxG0AYE1V1a2TvCfJs1tr35u9rbXWkrQ1GRgMpKoeneQbrbX1az0W2EZsn2TfJK9rrT0gyQ9zw8e9k3iPguXq0x48NtM/Gt05yS7ZBs4Wha3Jzfk9SdwG2HRfTnLXmet36cuADaiqW2YK229rrb23L/763Ee5+9/f6Msda7C4Byd5TFVdlml6rIdnmi/4dv0j4MmNj5nrj6d++65Jvr0lBwxbuS8l+VJr7TP9+rszxW7vUbByhyS5tLX2zdbaNUnem+l9y3sUbJqVvidtk+9V4jbApjszyZ792753yPTlKCev8Zhgq9fnTvyHJBe31v5u5qaTk8x9c/eRSf7vzPIn9W//fmCSK2c+hgc3a621P2ut3aW1tkem96GPttaOSHJaksP6avOPp7nj7LC+/s3ybB9YSGvta0kur6pf6IsekeSieI+CjfFfSR5YVbfq//03dzx5j4JNs9L3pA8neWRV/Uz/RMUj+7Khld8PAJuuqh6Vaa7T7ZK8sbX2kjUeEmz1quohST6R5PzcMEfw8zPNu/3OJHdL8p9Jfqe19p3+P0PHZfoY64+SPLm1dtYWHzhs5arqoCTPba09uqrumelM7tsnOSfJ77XWflJVOyV5a6a57r+T5AmttS+u1Zhha1RV+2T6gtYdknwxyZMznSDmPQpWqKpelOTwJD/N9H70h5nm+vUeBctQVW9PclCSOyb5epIXJnl/VvieVFV/kOn/uZLkJa21N23J/dgcxG0AAAAAAIZjWhIAAAAAAIYjbgMAAAAAMBxxGwAAAACA4YjbAAAAAAAMR9wGAAAAAGA44jYAAAAAAMMRtwEAAAAAGI64DQAAAADAcP4/JlS/WARZwxgAAAAASUVORK5CYII=\n"
          },
          "metadata": {
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Lets test Augmentation for a Medical Dataset \n",
        "num_aug_per_label=25\n",
        "n_per_class_total = 250\n",
        "epochs=20\n",
        "prompt_label_prefix=' reports:'\n",
        "\n",
        "# Lets make generation a bit longer, since the label is already quite long and it counts into MaxOutputLength\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(50)\n",
        "\n",
        "\n",
        "\n",
        "no_aug_fitted_classifier, aug_fitted_classifier = compare_vanilla_and_augmented_training(\n",
        "  base_dataset=train_df,\n",
        "  generator_model=gpt2_pipe,\n",
        "  embeddings_to_use=embeddings_to_use,\n",
        "  n_per_class_total=n_per_class_total,\n",
        "  train_test_frac=train_test_frac,\n",
        "  label_col=label_col,\n",
        "  prompt_label_prefix=prompt_label_prefix,\n",
        "  slice_len=slice_len,\n",
        "  num_aug_per_label=num_aug_per_label,\n",
        "  epochs=epochs,\n",
        "  learn_rate=learn_rate,\n",
        "  batch_size=batch_size,\n",
        "  droput=droput)\n",
        "\n",
        "# Improvements in some classes nice, improvable with tweaking or generating more data"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1CrX37DzFghj",
        "outputId": "73e4e87d-30b4-463d-dfc5-cb4c4e901141",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "__________________________________________________ Starting Data Augmentation Experiment __________________________________________________\n",
            "Train Dataset Size = 125\n",
            "Test Dataset Size = 1125\n",
            "Training Vanilla Classifier\n",
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "Training Settings : Epochs=20, learn_rate=0.0005, batch_size=32, dropout=0.5\n",
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n",
            "__________________________________________________ Metrics on Train datataset with Vanilla Model __________________________________________________\n",
            "                                                                   precision    recall  f1-score   support\n",
            "\n",
            "                                      BMJ (Clinical research ed.)       0.49      0.76      0.60       225\n",
            "                                          Frontiers in psychology       0.44      0.82      0.58       225\n",
            "International journal of environmental research and public health       0.00      0.00      0.00       225\n",
            "                                      Journal of medical virology       0.50      0.74      0.59       225\n",
            "                medRxiv : the preprint server for health sciences       0.59      0.06      0.11       225\n",
            "\n",
            "                                                         accuracy                           0.48      1125\n",
            "                                                        macro avg       0.40      0.48      0.37      1125\n",
            "                                                     weighted avg       0.40      0.48      0.37      1125\n",
            "\n",
            "__________________________________________________ Metrics on Test Dataset with Vanilla Model __________________________________________________\n",
            "                                                                   precision    recall  f1-score   support\n",
            "\n",
            "                                      BMJ (Clinical research ed.)       0.70      0.97      0.81        36\n",
            "                                          Frontiers in psychology       0.53      0.93      0.68        27\n",
            "International journal of environmental research and public health       0.00      0.00      0.00        28\n",
            "                                      Journal of medical virology       0.54      0.81      0.65        31\n",
            "                medRxiv : the preprint server for health sciences       1.00      0.19      0.32        26\n",
            "\n",
            "                                                         accuracy                           0.61       148\n",
            "                                                        macro avg       0.56      0.58      0.49       148\n",
            "                                                     weighted avg       0.56      0.61      0.51       148\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Generating for label=BMJ (Clinical research ed.) No. 1/5 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <BMJ (Clinical research ed.) reports:Thailand tightens covid-1> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Thailand tightens covid-1-type and covid2-type-associated DNA repeats to increase diversity (14) and has been shown to have an effect (15, 16). As a\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Where are we on> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Where are we on this?Vitamins and Minerals, J. D. N. , Zuokai, M., Chang, F., Wang, D.,\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Plans to share > __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Plans to share genetic material that is needed to generate human cells for a human embryo. This will be part of the first-ever collaboration between scientists at the University of North Carolina\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Cancer care at > __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Cancer care at an early stage , Clinical Biologist , 2015 , vol. 40 5 (pg. 439 - 49 ) , vol.(pg. Jansson VW O\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <BMJ (Clinical research ed.) reports:Assessing risk for health> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Assessing risk for health at risk, 2014.\n",
            "\n",
            "A study in the United Kingdom from 2013 found 24 per cent of people who were exposed to asbestos in their homes (22) and 21 per cent\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <BMJ (Clinical research ed.) reports:Prophylactic anticoagulat> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Prophylactic anticoagulatetics were achieved by using T3-tetrahydrocannabinol (THC), a class of hormone responsible for the natural release of cannabinoids and other\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <BMJ (Clinical research ed.) reports:Doctors faced with imposs> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Doctors faced with impossibilities: doctors and patients who could not treat or treat their patients, but did have a way to assist in their treatment... Patients were 'more well-rounded' and 'more\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: US suspends Joh> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: US suspends Johansen's visa to Norway from September 28, pending his official departure from Norway on August 19, the European Commission announced on August 26.\n",
            "\n",
            "\n",
            "Johan\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Ethnicity vacci> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Ethnicity vaccius, Nd.H.Hod. , Nd., Nd,, 2.02, 1.20. [11] H.B\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Brazil's spiral> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Brazil's spiral and coiled web and the web of the web (Clínica da Fonseca, 2002), see also:OECD MOU: \"C\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <BMJ (Clinical research ed.) reports:Co-producing the covid-19> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Co-producing the covid-1917 (Firm-type and non-fluorourishing) protein is the largest component of the human diet and the highest-risk component, by a dose\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: NHS will keep o> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: NHS will keep ointment in the NHS and offer it for low-income patients, even after these individuals are discharged from the hospital for treatment that does not meet clinical and\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Female doctors > __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Female doctors are found to be significantly more likely to be overweight than other groups for the third time\n",
            "\n",
            "4-10.9 years old female doctors may be overweight, but\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Military coup i> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Military coup i.e. mass murders to kill civilian and military targets with chemical weapons, etc. A report to the Committee on Chemical Weapons is presented to the Chemical Weapons Convention\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Unusual blood c> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Unusual blood cations during normal skin biopsies of various bacterial strains. (1)(a)-Covidium hydroxide is found on the surface of the skin\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: NHS England ple> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: NHS England pleiotropic and paracaripotential injuries/\n",
            "\n",
            "Hospitals in London Hospitals report (10/23/06) that \"there\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: India should st> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: India should stupefy and kill the bats, but it's not necessary to do it that way.\n",
            "\n",
            "I can't imagine it happening, but if you can't\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <BMJ (Clinical research ed.) reports:Covid:19: Ethnic minority> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid:19: Ethnic minority\n",
            "\n",
            "Pompeii:1Col.:18 C, 1D;4D.CODESAIMI/FIC.TEX.\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: WHO says labora> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: WHO says labora, cadaver, and zebra. http://www.fae.dia.nih.gov/clinicreport/id/2788\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Mass testing at> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Mass testing at 10°C (8 nm-1).\n",
            "\n",
            "Figs 2 D and 2 E show the results of a series of foci of small-m\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: public health l> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: public health lumbar spine is the most common outcome among patients with a progressive spine fracture, which affects up to 20% of the general population.2,3 On the\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: One in seven pe> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: One in seven peptic ulcers in children was resistant to covidone (P.J. et al., 2005).1-3 The C. albicans\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <BMJ (Clinical research ed.) reports:Covid-19: Colleges publis> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Covid-19: Colleges publis-2, 2002.2\n",
            "\n",
            "Stern JW, Schuster RM. The effects on serum lipid. Am J Hum Genet 1998;55:\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <BMJ (Clinical research ed.) reports:Optimal dosing interval f> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Optimal dosing interval f/d. The median tetracycline tolerance dose of diazepam at dose 1.03 mg/day at week 7 of therapy is approximately 2.5 mg\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <BMJ (Clinical research ed.) reports:Assessing risk for health> __________________________________________________\n",
            "Generated : \n",
            "  BMJ (Clinical research ed.) reports:Assessing risk for health outcomes after obesity. JAMA 1990;284:30-45.\n",
            "\n",
            "\"The importance of exercise to control blood sugar levels is well established\"\n",
            "\n",
            "Ivan Gom\n",
            "__________________________________________________ Generating for label=Journal of medical virology No. 2/5 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <Journal of medical virology reports:Vitamin D axis status and> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Vitamin D axis status and mortality rates and dietary fiber intakes as risk factors and the effect of chronic vitamin D deficiency on bone health.\n",
            "\n",
            "The paper was published online December 4, 2016.\n",
            " and\n",
            "\n",
            "L\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <Journal of medical virology reports:Remdesivir for the treatm> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Remdesivir for the treatmigration of dengue to West Africa is currently underway. The disease has been diagnosed in about 100 people [including about 90 people from China and South Korea] who have received treatment\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <Journal of medical virology reports:Could we predict the prog> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Could we predict the prognosis for cancer with our high-resolution spectroscopy models? To answer this question we created a large-scale version of the UMR project which uses high-end spectroscopic models to\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <Journal of medical virology reports:Herpes zoster emergence f> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Herpes zoster emergence furs the virus among healthy adults, and, on the basis of findings in animal models, is an important factor in establishing treatment options. (13)\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <Journal of medical virology reports:Peripheral biomarkers' pa> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Peripheral biomarkers' paisei-peripheral, peripherals (b). Bioscience and Biology, 28, 895-901. PMID: 9248601 PMID : 10\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <Journal of medical virology reports:Biochemical rationale for> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Biochemical rationale for the protective effect of diclofenac on breast tissue and breast cancer:A review [link].\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <Journal of medical virology reports:Vaccine design based on 1> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Vaccine design based on 1:1 combination has been shown not to support safety of vaccines in infants who receive them. It is now clear that in the absence of evidence of risk factors with no clinical value, the\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <Journal of medical virology reports:Dengue amidst COVID-19 in> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Dengue amidst COVID-19 in the Philippines at the end of August:It appears that the death of at least 3,600 people has forced the government of Ferdinand Marcos. It has caused political unrest in the\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <Journal of medical virology reports:Cocirculation of COVID-19> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Cocirculation of COVID-19-2 by A. J. P. McFarland and M. D. Dey , Nature Publishing Group . June 24 , 2007 , vol. 368 (pg.\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <Journal of medical virology reports:Surveillance and re-posit> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Surveillance and re-positories of the piperus gagei:\n",
            "\n",
            "\"Researchers have long known that the vascular structure of the Piperus Gagei increases with time… But the researchers\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <Journal of medical virology reports:Hypoalbuminemia in patien> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Hypoalbuminemia in patienos with pulmonary hypertension. On the basis of this report, a trial on titeri (a naturally occurring polysaccharide), a prophylactic for prophagic\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <Journal of medical virology reports:Targeting the viral-entry> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Targeting the viral-entry space at the root of cancer or at the end of a pathogenesis stage is complex. The research suggests that the virus is present at the entry/tidal state, perhaps through the process\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <Journal of medical virology reports:Evaluation of seven comme> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Evaluation of seven commeaser-gene-treated rats in one experimental model of the GSS with different genotype and hormone levels during the treatment period, in a dose-dependent manner. E. G\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <Journal of medical virology reports:Clinical characteristics > __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Clinical characteristics and outcomes:Cases of noncompliance with vaccine\n",
            "\n",
            "Sensational therapy for influenza vaccination. In: National Institutes of Health Guidelines and Practice and Clinical Guidelines for International Vaccine Program, pp. 9\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <Journal of medical virology reports:Stool samples versus naso> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Stool samples versus nasoembolism are more sensitive than samples taken from patients that have a history of fever and other conditions that involve gastrointestinal manifestations. The results are not immediately comparable. Moreover, nasal swabs are\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <Journal of medical virology reports:COVID-19 related fatigue:> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:COVID-19 related fatigue:\n",
            "\n",
            "\"On average, patients respond to three of the five stress disorders, like arthritis, back pain, fatigue, stress syndrome, headaches, memory impairment, hyperactivity, and depression\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <Journal of medical virology reports:Clinical validation of a > __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Clinical validation of a novel pharmacologic response to oestrogen is needed for improved prognosis, including the realization that a low-dose pharmacologic action of oestrogens is not sufficient to achieve an optimal endocrine\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <Journal of medical virology reports:Suppression of SARS-CoV-2> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Suppression of SARS-CoV-2-induced immunocytosis is a major cause of morbidity worldwide and is strongly linked to the rapid decline in HIV incidence, worldwide respiratory morbidity and mortality worldwide.\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <Journal of medical virology reports:Prognostic roles of KL-6 > __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Prognostic roles of KL-6 in pulmonary embolism: the role of an alternative pathogen to KL-4 in acute renal failure: an update on KL-5:LBP14-18 (2003\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <Journal of medical virology reports:Changes in computed tomog> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Changes in computed tomogracheal and nephrotic tomography results in increased rates of CVD of 17.8%-9.5% in the male group compared with 12.8% for the female\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <Journal of medical virology reports:Development and validatio> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Development and validatio of all patients with coronary disease, and cardiovascular disease. This is the first report to provide detailed information on the quality of life in these patients, as well as the risk of stroke at baseline.In\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <Journal of medical virology reports:The inflammatory markers > __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:The inflammatory markers C, AR, and AD in a mouse model of chronic traumatic encephalopathy (CTE) show increased tumor necrosis factor, suggesting that inflammation could be an underlying mechanism of the observed increase in T\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <Journal of medical virology reports:Impact on disease mortali> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Impact on disease mortali et al. (2009) , Journal of the Medical Research Council, BMJ, July 2009 . doi: 10.1136/bjc.09.1.2009 PubMed Abstract |\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <Journal of medical virology reports:Rapid adaptation and cont> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:Rapid adaptation and contiguity, where the central nervous system is the main force governing the response of the entire organism to environmental stimuli by a complex neural machinery known as \"stress\" (reviewed in Caucker et\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <Journal of medical virology reports:A bioinformatic predictio> __________________________________________________\n",
            "Generated : \n",
            "  Journal of medical virology reports:A bioinformatic predictio : e\n",
            "\n",
            "Copyright © 2014 by the Authors. Published by Elsevier Inc., All rights reserved. Visit our policies for more information.\n",
            "__________________________________________________ Generating for label=medRxiv : the preprint server for health sciences No. 3/5 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <medRxiv : the preprint server for health sciences reports:Optimal test-assisted qua> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Optimal test-assisted quaSPACE : your source code for spachamax-spacer : Spacham is not a part of the spachapamax community\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <medRxiv : the preprint server for health sciences reports:SARS-CoV-2 recruits a hae> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:SARS-CoV-2 recruits a haeSAR-2HUBD : the second haeSubC-2 (also C-3 and B) team (\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <medRxiv : the preprint server for health sciences reports:High-Quality Masks Reduce> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:High-Quality Masks Reduce the performance of your Masks by 20% while you deal damage, while the damage does not slow your healing. The damage you deal does not affect damage\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <medRxiv : the preprint server for health sciences reports:Genomic epidemiology of S> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Genomic epidemiology of SARS. The genetics of S. pneumoniae and its variants, a single-celled microbial community. (A.C. van Dam, T.\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <medRxiv : the preprint server for health sciences reports:Evolving Phenotypes of no> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Evolving Phenotypes of no-tolerance individuals for schizophrenia\n",
            "\n",
            "\"I've had such issues at work here, that I'm very aware of why they need to be kept\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <medRxiv : the preprint server for health sciences reports:Mapping of SARS-CoV-2 Bra> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Mapping of SARS-CoV-2 Braq.zip file from the archive to a file in C:\\Program Files [0.5.0]- This file can be\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <medRxiv : the preprint server for health sciences reports:Infection and mRNA-1273 v> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Infection and mRNA-1273 v1.0\n",
            "\n",
            ": the preprints server for intelligence.mnt\n",
            "\n",
            "v1.1 n: the program to compute the\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <medRxiv : the preprint server for health sciences reports:PCR assay to enhance glob> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:PCR assay to enhance globular cell transplantation in a primary care model of diabetes : the primary care approach to transplantation , PLOS ONE , 8 , (e0111008\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <medRxiv : the preprint server for health sciences reports:LncRNAs NEAT1 and MALAT1 > __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:LncRNAs NEAT1 and MALAT1, NEAT2 and MALT1 ;\n",
            "\n",
            ": , ; LNCRNAs NCLKJR3 ;\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <medRxiv : the preprint server for health sciences reports:De novo Powered Air-Purif> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:De novo Powered Air-Purif-R: https://bitcointalk.org/index.php?topic=2136.0 Disclaimer: The Disclaimer is\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <medRxiv : the preprint server for health sciences reports:Geographically-targeted C> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Geographically-targeted C.C.V.V.:A database of data generated by a biogenetic engineer in the field, with information for geographic location and information about current\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <medRxiv : the preprint server for health sciences reports:The National COVID Cohort> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:The National COVID Cohort Survey was undertaken on 5 August 2010. It was used to establish the knowledge base of research related to the human genome and on the relationship between DNA and human\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <medRxiv : the preprint server for health sciences reports:Hyperglycemia in Acute CO> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Hyperglycemia in Acute COX-1 and Acute Heart Failure is the most common chronic inflammatory state in patients with Chronic Heart Failure, which results in a significant and ongoing\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <medRxiv : the preprint server for health sciences reports:RISK FACTORS FOR INFECTIO> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:RISK FACTORS FOR INFECTIO-A RULER ON CREATING DATA IN HISTORY & DATA FILENAMES! (Updated 2013)RISA CHANCELL\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <medRxiv : the preprint server for health sciences reports:SARS-CoV-2 specific T cel> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:SARS-CoV-2 specific T cel (M)L x YYR (B)CX v BXV y CXZ (B)*H3 (\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <medRxiv : the preprint server for health sciences reports:The interplay of policy, > __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:The interplay of policy, policy information, and policy feedback are being tested through \"Banking in Medicine\", a course sponsored by the National Center for Health Statistics. The purpose of the\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <medRxiv : the preprint server for health sciences reports:Acute Brain Ischemia, Inf> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Acute Brain Ischemia, Inflammation, and Neural Fiber Loss: a study of the first 20 years after the Great Crash.JAMA. 2014 Nov 15;301(\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <medRxiv : the preprint server for health sciences reports:Integrative approach iden> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Integrative approach iden : an IETF's idea of an online, peer-to-peer medical journal, published in September 2004, published by the World Health Organization and based on\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <medRxiv : the preprint server for health sciences reports:Alpha-1 blockers and susc> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Alpha-1 blockers and suscribers:\n",
            "\n",
            "The paper's introduction\n",
            "\n",
            "In 2009, my student Professor Stephen K. Moore introduced his student Ph.D. research into\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <medRxiv : the preprint server for health sciences reports:Histopathological assessm> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Histopathological assessm - provides a comprehensive survey of the anatomy of the face with the help of the Facial Reconstruction Project - a high-performance MRI analysis of the entire face which\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <medRxiv : the preprint server for health sciences reports:Inactivation of SARS-CoV-> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Inactivation of SARS-CoV-7C, an experimental variant of the HVTARv1, a variant of SCCAV5 that was developed by the Ukrainian Federal\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <medRxiv : the preprint server for health sciences reports:COVID-19 Mortality in Cal> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:COVID-19 Mortality in Calvary County; 2011-08-02:CAMR-14; ABA; M.S.C._Mortality;\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <medRxiv : the preprint server for health sciences reports:High levels of common col> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:High levels of common coliform disease common in the United States. Epidemiology of coliform diseases, 2005 5 : 659-668 . doi: 10.1016/j.ep\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <medRxiv : the preprint server for health sciences reports:Drug repositioning candid> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:Drug repositioning candidiasis (SciRx1 or Rxrx2) for patients with SciRxes1 , the European Medicines Agency (EMA) , 8\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <medRxiv : the preprint server for health sciences reports:SARS-CoV-2 Seroprevalence> __________________________________________________\n",
            "Generated : \n",
            "  medRxiv : the preprint server for health sciences reports:SARS-CoV-2 Seroprevalence (95):http://sars-cpv.netv4.net/SARS/1.2.4\n",
            "__________________________________________________ Generating for label=Frontiers in psychology No. 4/5 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <Frontiers in psychology reports:Hindu Response to Dying a> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Hindu Response to Dying a Hero (Washington DC: National Academy of Sciences, 1988):The effects of religion on the prevention of dying in Buddhist traditions, such as Sivananda Buddhism, were considered by Sivanand of\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <Frontiers in psychology reports:Factors Influencing Publi> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Factors Influencing PubliRelationship\n",
            "\n",
            "2.3. Conflict: Relationships and Conflict Resolution:\n",
            "\n",
            "5.4. Conflict Management:\n",
            ".@Conflict Management\n",
            "\n",
            "4.10. Communication:\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <Frontiers in psychology reports:Pandemic Leadership: Sex > __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Pandemic Leadership: Sex Work , 9 , 9 and 15\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <Frontiers in psychology reports:Maladaptive Daydreaming i> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Maladaptive Daydreaming i (2001) \"A new study demonstrates that a negative mood is more readily reported in dream-oriented individuals, but only when they are also perceived to have a negative state of mind\". In\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <Frontiers in psychology reports:Our Virtual Tribe: Sustai> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Our Virtual Tribe: Sustai, an online study of human social life, found that social interaction was strongly associated with increased aggression and aggression-like behaviour. The study found that participants who received an extra food and a drink\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <Frontiers in psychology reports:The Role of Musical Aesth> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:The Role of Musical Aesthorts and The Meaning of Sexual Attraction in the Development of Adulthood:A study of men who participated in the L.A. International Music Association (LIA/MIAP)\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <Frontiers in psychology reports:Analysis and Psychoeducat> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Analysis and Psychoeducat. Journal of Experimental Social Psychology, 24(1): 57–80.\n",
            "\n",
            "Alkali, E.B., Pölller, E., & Reisman, A. (1996\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <Frontiers in psychology reports:Further to the Left: Stre> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Further to the Left: Strengths and weaknesses.\n",
            "\n",
            "In the recent past, I have attempted to reconcile these two lines within psychology: in my recent publications on the psychology of prejudice and prejudice , I made it very\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <Frontiers in psychology reports:Coping With COVID-19: The> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Coping With COVID-19: The Influence of Interactions on Life Satisfaction. (2004, 4th ed.) Available at http://www.bibliophile.org/research/civ.htm.\n",
            "\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <Frontiers in psychology reports:Icing on the Cake: \"Ampli> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Icing on the Cake: \"Ampli. , \"Treat your face as a receptacle of social contact . . .\": \"It won't make you a good listener.\" (Cheshire 2004), on\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <Frontiers in psychology reports:Message Framing Effects o> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Message Framing Effects o 1:Biological or social influences in an intergroup social context (1) An open, self-selected, and social network in animals: Evidence: The behavioral effects of groups of animals are not\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <Frontiers in psychology reports:Protective and Risk Facto> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Protective and Risk Facto-Consultancy on Psychological Conclusions . Annals of Emergency Medicine, 2010 May. 27(9): 1414-1419. doi:10.1002/ajem.11\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <Frontiers in psychology reports:Growth Mindset and Colleg> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Growth Mindset and Collegiate Psychology in the Social Sciences (2012):1045 to 7.\n",
            "\n",
            "Dorner, Peter S. and Ruhlheim, Michael N. (2010): 'Learning Style'\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <Frontiers in psychology reports:Workload, Techno Overload> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Workload, Techno Overload and Discomfort Related to Occupational and Environment Effects on Work Hours and Productivity, Volume 4, No. 5, April 1996\n",
            "\n",
            "[16] See also: \"Work Hours and\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <Frontiers in psychology reports:Predicting the Severity o> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Predicting the Severity o.d. of Psychotic Disorder and Related Disorders: \"All social conflicts have become increasingly confrontational between dissident groups, individuals exhibiting psychotic disorders, or the general public. The current crisis\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <Frontiers in psychology reports:Stress, Sleep and Psychol> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Stress, Sleep and Psycholimetics in the United States\n",
            "\n",
            "\"Some men go to sleep in the same manner as their mates do, or they sleep while they're out drinking or when they eat or when sleeping\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <Frontiers in psychology reports:Higher Physical Activity > __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Higher Physical Activity and Obesity. 2015 Mar 25;21(9):2927-31.\n",
            "\n",
            "12. Gao. D.H. et al. Increased energy expenditure is associated with body weight loss and obesity in\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <Frontiers in psychology reports:Negative Affectivity, Aut> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Negative Affectivity, Autistic, and Bipolar Disorder: Implications for Therapy. J Hum Reprod. 19(14):1958-1967 View in Article Scopus (1059)\n",
            "\n",
            "PubMed\n",
            "\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <Frontiers in psychology reports:How Does a Sport Psycholo> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:How Does a Sport Psycholo Program Work?A number of factors and influences influence the use of sport-related programs in our life. What are your thoughts on this? Discuss your knowledge, experiences or experience as well as suggestions\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <Frontiers in psychology reports:Digital Approaches to Mus> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Digital Approaches to Muscular Performance (BRIANSON, JOSE) . Annals of Behavioral and Brain Sciences (DOI: 10.1037/1949-1810) , .\n",
            "\n",
            "Tob\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <Frontiers in psychology reports:Depression and Anxiety Am> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Depression and Anxiety Am J Psychiatry. 2007 Sep;57(5):1072-79. doi: 10.1176/j.01.13.55\n",
            "\n",
            "Stiglitz, R., and Taus\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <Frontiers in psychology reports:Psychosocial Framework of> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Psychosocial Framework of Cognitive Functions (2008), 7-11.\n",
            "\n",
            "Palloy-Becker, N. (1996). Intelligence and the Psychology of the Mind. Oxford; New York: Oxford University Press.\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <Frontiers in psychology reports:Suffering and Salutogenes> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Suffering and Salutogenes, we are aware that in the future many of our current behavioral and social skills might well be developed or developed more widely. And yet we are yet to learn the necessary social skills and\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <Frontiers in psychology reports:Educational and Social Ex> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Educational and Social Exclusion: From Unintentional Behavior to the Criminalization of Discrimination in America (Vol. 29, No. 7, 1991).\n",
            "\n",
            "8. Erikson, J. (2004a).\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <Frontiers in psychology reports:Comparison Between Conven> __________________________________________________\n",
            "Generated : \n",
            "  Frontiers in psychology reports:Comparison Between Convenience and Conveniences (p. 590), it appears that satisfaction with life increases with stress, even when stressors are minimized (Chung et al., 2012a, b, c).\n",
            "__________________________________________________ Generating for label=International journal of environmental research and public health No. 5/5 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/25 for prompt : <International journal of environmental research and public health reports:Fluctuations in National > __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Fluctuations in National Household Health Insurance Coverage\n",
            "\n",
            "P. S. Siegel, C. L. Williams, F. K. Hegerstetter and O. R. Lipsky\n",
            "__________________________________________________ Generating new training data example 2/25 for prompt : <International journal of environmental research and public health reports:Peritraumatic Distress du> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Peritraumatic Distress du Nord-Lebanon, Paris, 2001. (p65). In \"Is there a risk of this kind of earthquake in Africa?\", it is reported that the damage\n",
            "__________________________________________________ Generating new training data example 3/25 for prompt : <International journal of environmental research and public health reports:COVID-19 Vaccine Acceptan> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:COVID-19 Vaccine Acceptanisms and Epidemiological References , Journal of Environmental Law and Environmental Health. 2009 , vol. 38 (pg. 23 - 26 ) , vol.(pg. 1.\n",
            "__________________________________________________ Generating new training data example 4/25 for prompt : <International journal of environmental research and public health reports:COVID-19 Severity and Neo> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:COVID-19 Severity and Neo-Aged Pairs\n",
            "\n",
            "Friedman et al., J Public Health (1919).\n",
            "\n",
            "\"New World Order: A Biobot Analysis .\"\n",
            "__________________________________________________ Generating new training data example 5/25 for prompt : <International journal of environmental research and public health reports:Eco-Environmental Aspects> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Eco-Environmental Aspects of Biological Diversity.\n",
            "__________________________________________________ Generating new training data example 6/25 for prompt : <International journal of environmental research and public health reports:The Impact of the COVID-1> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:The Impact of the COVID-1 on Global Bioengineering. Science 319, 895-803 (2012).\n",
            "\n",
            "Troubleman, J. & B. van den Bruhl, \"\n",
            "__________________________________________________ Generating new training data example 7/25 for prompt : <International journal of environmental research and public health reports:SARS-CoV-2 Seroprevalence> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:SARS-CoV-2 Seroprevalence of Influenza Virus-associated Cholera Virus (CvC) in Childhood and Future Infectious Diseases: A Systematic Review\" [\n",
            "__________________________________________________ Generating new training data example 8/25 for prompt : <International journal of environmental research and public health reports:Parent-Infant Skin-to-Ski> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Parent-Infant Skin-to-Ski Health Care System (2007, 2009). \"Risk factors for developing skin contact dermatitis: a comprehensive clinical study on a broad population of 1.4\n",
            "__________________________________________________ Generating new training data example 9/25 for prompt : <International journal of environmental research and public health reports:Overwhelmed by Technostre> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Overwhelmed by Technostrengtha's financial crisis, G.S.D. said he had found a way to get his son's passport within 30 days. But the government had lost track\n",
            "__________________________________________________ Generating new training data example 10/25 for prompt : <International journal of environmental research and public health reports:Reaching out for Help: Ca> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Reaching out for Help: Cautionary Tales in the Science of Obesity: http://www.sciencedirect.com/science/article/pii/S00358534006737\n",
            "__________________________________________________ Generating new training data example 11/25 for prompt : <International journal of environmental research and public health reports:Knowledge, Attitude, and > __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Knowledge, Attitude, and Health. Oxford: Pp. 967–93.\n",
            "\n",
            "Wang. 2005. The Role of a Population‐Centered Epidemiological Model in Sudden Inf\n",
            "__________________________________________________ Generating new training data example 12/25 for prompt : <International journal of environmental research and public health reports:Economic Role of Populati> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Economic Role of Populati to the Climate Process in Emerging and Plumed Lands, vol. 44, no. 3 pp. (Spring 2009): 13–27.\n",
            "\n",
            "Hansen, J.,\n",
            "__________________________________________________ Generating new training data example 13/25 for prompt : <International journal of environmental research and public health reports:The Effect of Lockdown Pe> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:The Effect of Lockdown Peeces On the Prevention of Chronic Disease in Adults.JHJ.Sci. 2003:22.doi.org/10.1007/s12173-013\n",
            "__________________________________________________ Generating new training data example 14/25 for prompt : <International journal of environmental research and public health reports:Cardiorespiratory Fitness> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Cardiorespiratory Fitness.2015 8:2, p. 1654. doi:10.1016/j.cbvint.2014.12.089 PubMed Abstract | CrossRef Full Text\n",
            "__________________________________________________ Generating new training data example 15/25 for prompt : <International journal of environmental research and public health reports:Multicomponent Home-Based> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Multicomponent Home-Based Therapeutic Health Systems for Humans, 2007. Available online at: http://articles.mhsv.org/epidemiology/homes/happen\n",
            "__________________________________________________ Generating new training data example 16/25 for prompt : <International journal of environmental research and public health reports:Spatial and Social Behavi> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Spatial and Social Behaviours.\n",
            "\n",
            "Sheffrey, C. W. (1898). Sexualities in human society. Cambridge, UK: Cambridge University Press.\n",
            ".\n",
            ",\n",
            "\n",
            "__________________________________________________ Generating new training data example 17/25 for prompt : <International journal of environmental research and public health reports:Co-Infections in Critical> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Co-Infections in Critical Care: an International journal, was published in 2002, where it was considered safe to live in a home with at least one in the three sites infected. The American Society\n",
            "__________________________________________________ Generating new training data example 18/25 for prompt : <International journal of environmental research and public health reports:Job Demands, Resources an> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Job Demands, Resources an Evaluation, 2014. ( http://dx.doi.org/10.2306/jmpg-2016-1045 .)\n",
            "\n",
            "[31]\n",
            "\n",
            "*\n",
            "__________________________________________________ Generating new training data example 19/25 for prompt : <International journal of environmental research and public health reports:Impact of the COVID-19 Pa> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Impact of the COVID-19 Paquette et al. (2016).\n",
            "\n",
            "\"Covenant-based and individual-based vaccines, including a single-dose, nonmedical version (\n",
            "__________________________________________________ Generating new training data example 20/25 for prompt : <International journal of environmental research and public health reports:A Non-Linear Biostatistic> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:A Non-Linear Biostatistician in Global Environment, Vol. 38, No. 2, pp. 19 - 31 , 2004 , vol. 24 (pg. 1387 - 4 )\n",
            "__________________________________________________ Generating new training data example 21/25 for prompt : <International journal of environmental research and public health reports:Social Isolation and Lone> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Social Isolation and Lone Star Rising - http://dx.doi.org/10.1189/S0315-63092004\n",
            "__________________________________________________ Generating new training data example 22/25 for prompt : <International journal of environmental research and public health reports:Changes in Alcohol Consum> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Changes in Alcohol Consumable Footprint of Menstrual Dysfunction. Journal of the American Dietetic Association, January 1994.\n",
            "\n",
            "Wass: \"The Effect of Alcohol on the Maternal Fat\n",
            "__________________________________________________ Generating new training data example 23/25 for prompt : <International journal of environmental research and public health reports:Prediction Models for Pub> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Prediction Models for Publish, 2005;\n",
            "\n",
            "P.G. Kellett (N.T., J.L.) & J.R. Voss (R.) The role of carbon\n",
            "__________________________________________________ Generating new training data example 24/25 for prompt : <International journal of environmental research and public health reports:Factors Associated with t> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:Factors Associated with tp: a review of previous literature and relevant relevant public health and environmental research results for the past 15 years and for the potential of future studies with respect to the impact of exposure to\n",
            "__________________________________________________ Generating new training data example 25/25 for prompt : <International journal of environmental research and public health reports:COVID-19 Perceived Impact> __________________________________________________\n",
            "Generated : \n",
            "  International journal of environmental research and public health reports:COVID-19 Perceived Impact of Environmental Effects in the Development of Pesticides on Biodiversity and Natural Habitat, Proceedings of the American Academy of Sciences 105, 727-756 (\n",
            "Augmented Training Dataset Size = 250\n",
            "Training Augmented Classifierr\n",
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "Training Settings : Epochs=20, learn_rate=0.0005, batch_size=32, dropout=0.5\n",
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n",
            "__________________________________________________ Metrics on vanilla Train datataset with AUGMENTED Model __________________________________________________\n",
            "                                                                   precision    recall  f1-score   support\n",
            "\n",
            "                                      BMJ (Clinical research ed.)       0.65      0.94      0.77        36\n",
            "                                          Frontiers in psychology       0.56      0.85      0.68        27\n",
            "International journal of environmental research and public health       1.00      0.11      0.19        28\n",
            "                                      Journal of medical virology       0.47      0.65      0.54        31\n",
            "                medRxiv : the preprint server for health sciences       0.67      0.23      0.34        26\n",
            "\n",
            "                                                         accuracy                           0.58       148\n",
            "                                                        macro avg       0.67      0.56      0.51       148\n",
            "                                                     weighted avg       0.67      0.58      0.52       148\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on vanilla Train datataset with VANILLA Model __________________________________________________\n",
            "                                                                   precision    recall  f1-score   support\n",
            "\n",
            "                                      BMJ (Clinical research ed.)       0.49      0.76      0.60       225\n",
            "                                          Frontiers in psychology       0.44      0.82      0.58       225\n",
            "International journal of environmental research and public health       0.00      0.00      0.00       225\n",
            "                                      Journal of medical virology       0.50      0.74      0.59       225\n",
            "                medRxiv : the preprint server for health sciences       0.59      0.06      0.11       225\n",
            "\n",
            "                                                         accuracy                           0.48      1125\n",
            "                                                        macro avg       0.40      0.48      0.37      1125\n",
            "                                                     weighted avg       0.40      0.48      0.37      1125\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Test datataset with AUGMENTED Model __________________________________________________\n",
            "                                                                   precision    recall  f1-score   support\n",
            "\n",
            "                                      BMJ (Clinical research ed.)       0.53      0.91      0.67        91\n",
            "                                          Frontiers in psychology       0.59      0.71      0.64        92\n",
            "International journal of environmental research and public health       0.70      0.08      0.14        87\n",
            "                                      Journal of medical virology       0.53      0.59      0.56        78\n",
            "                medRxiv : the preprint server for health sciences       0.51      0.38      0.43        66\n",
            "\n",
            "                                                         accuracy                           0.55       414\n",
            "                                                        macro avg       0.57      0.53      0.49       414\n",
            "                                                     weighted avg       0.57      0.55      0.49       414\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Test datataset with VANILLA Model __________________________________________________\n",
            "                                                                   precision    recall  f1-score   support\n",
            "\n",
            "                                      BMJ (Clinical research ed.)       0.70      0.97      0.81        36\n",
            "                                          Frontiers in psychology       0.53      0.93      0.68        27\n",
            "International journal of environmental research and public health       0.00      0.00      0.00        28\n",
            "                                      Journal of medical virology       0.54      0.81      0.65        31\n",
            "                medRxiv : the preprint server for health sciences       1.00      0.19      0.32        26\n",
            "\n",
            "                                                         accuracy                           0.61       148\n",
            "                                                        macro avg       0.56      0.58      0.49       148\n",
            "                                                     weighted avg       0.56      0.61      0.51       148\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Augmented Train datataset with AUGMENTED Model __________________________________________________\n",
            "                                                                   precision    recall  f1-score   support\n",
            "\n",
            "                                      BMJ (Clinical research ed.)       0.56      0.72      0.63       225\n",
            "                                          Frontiers in psychology       0.49      0.85      0.62       225\n",
            "International journal of environmental research and public health       0.48      0.06      0.11       225\n",
            "                                      Journal of medical virology       0.46      0.66      0.54       225\n",
            "                medRxiv : the preprint server for health sciences       0.42      0.18      0.25       225\n",
            "\n",
            "                                                         accuracy                           0.49      1125\n",
            "                                                        macro avg       0.48      0.49      0.43      1125\n",
            "                                                     weighted avg       0.48      0.49      0.43      1125\n",
            "\n",
            "______________________________________________________________________________________________________________________________________________________\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Data Augmentation on a Legal Dataset\n",
        "\n",
        "Using data from  https://www.kaggle.com/datasets/mohammedalrashidan/contracts-clauses-datasets?resource=download\n",
        "\n",
        "The data source is from was scraped from contracts website where  over 21k legal clauses have been collected from 16 type of clauses that are related to finance\n",
        "\n",
        "![img](https://media.istockphoto.com/photos/statue-of-lady-justice-and-supreme-court-building-picture-id1140705087?k=20&m=1140705087&s=170667a&w=0&h=qJ7xw7uTfoXdozrzpfeMb4C1UbU7X1cncJV8dCZszWc=)"
      ],
      "metadata": {
        "id": "2aZHl3U61y5w",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [
        "! wget http://ckl-it.de/wp-content/uploads/2022/09/legal_docs.csv\n",
        "import pandas as pd\n",
        "train_df = pd.read_csv('/content/legal_docs.csv')\n",
        "# the text data to use for classification should be in a column named 'text' \n",
        "train_df = train_df[['clause_text','clause_type']]\n",
        "train_df.columns = ['text','y']\n",
        "train_df\n",
        "\n",
        "\n",
        "# Lets only keep the first 5 labels for our experiment\n",
        "shrink_train_df = []\n",
        "labels_to_keep = ['base-salary', 'interest', 'investments', 'loans']\n",
        "for label in labels_to_keep:\n",
        "  shrink_train_df.append(train_df[train_df.y==label])\n",
        "shrink_train_df = pd.concat(shrink_train_df)\n",
        "shrink_train_df.y.value_counts().plot.barh(figsize=(20,16), title='Label Distribution')\n",
        "shrink_train_df\n"
      ],
      "metadata": {
        "id": "UJLaMWszbsji",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000
        },
        "outputId": "72fb41c8-c99c-4809-f777-9d8b70f75c5a",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2022-09-04 14:42:03--  http://ckl-it.de/wp-content/uploads/2022/09/legal_docs.csv\n",
            "Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209\n",
            "Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 13526151 (13M) [text/csv]\n",
            "Saving to: ‘legal_docs.csv.2’\n",
            "\n",
            "\rlegal_docs.csv.2      0%[                    ]       0  --.-KB/s               \rlegal_docs.csv.2    100%[===================>]  12.90M  64.7MB/s    in 0.2s    \n",
            "\n",
            "2022-09-04 14:42:04 (64.7 MB/s) - ‘legal_docs.csv.2’ saved [13526151/13526151]\n",
            "\n"
          ]
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "                                                   text            y\n",
              "9327   During the Employment Period, the Executive s...  base-salary\n",
              "9328   During the Employment Period, the Executive s...  base-salary\n",
              "9329   During the Term, the Executive’s initial annu...  base-salary\n",
              "9330   During the Employment Period, the Executive s...  base-salary\n",
              "9331   “Base Salary” shall have the meaning set fort...  base-salary\n",
              "...                                                 ...          ...\n",
              "5438                    As specified in the Prospectus.        loans\n",
              "5439   From: Kosmos Energy Finance International (th...        loans\n",
              "5440   2. The aggregate amount of the proposed Borro...        loans\n",
              "5441   The Borrower will not make any investment in ...        loans\n",
              "5442       (a ) Aggregate amount of new Loans to be $ ;        loans\n",
              "\n",
              "[3940 rows x 2 columns]"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-c8a1d756-74e1-485e-9cff-999100e07b6b\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>y</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>9327</th>\n",
              "      <td>During the Employment Period, the Executive s...</td>\n",
              "      <td>base-salary</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9328</th>\n",
              "      <td>During the Employment Period, the Executive s...</td>\n",
              "      <td>base-salary</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9329</th>\n",
              "      <td>During the Term, the Executive’s initial annu...</td>\n",
              "      <td>base-salary</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9330</th>\n",
              "      <td>During the Employment Period, the Executive s...</td>\n",
              "      <td>base-salary</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>9331</th>\n",
              "      <td>“Base Salary” shall have the meaning set fort...</td>\n",
              "      <td>base-salary</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>...</th>\n",
              "      <td>...</td>\n",
              "      <td>...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5438</th>\n",
              "      <td>As specified in the Prospectus.</td>\n",
              "      <td>loans</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5439</th>\n",
              "      <td>From: Kosmos Energy Finance International (th...</td>\n",
              "      <td>loans</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5440</th>\n",
              "      <td>2. The aggregate amount of the proposed Borro...</td>\n",
              "      <td>loans</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5441</th>\n",
              "      <td>The Borrower will not make any investment in ...</td>\n",
              "      <td>loans</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>5442</th>\n",
              "      <td>(a ) Aggregate amount of new Loans to be $ ;</td>\n",
              "      <td>loans</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "<p>3940 rows × 2 columns</p>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c8a1d756-74e1-485e-9cff-999100e07b6b')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-c8a1d756-74e1-485e-9cff-999100e07b6b button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-c8a1d756-74e1-485e-9cff-999100e07b6b');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 9
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 1440x1152 with 1 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAABK8AAAOVCAYAAACvQsOhAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdf8zud13f8debHixCsQwKWjrgDCy/lPFbsKCrQFB3RMHBOoQNtgyyZIQgDsFNM+bA1EAcIZrMLkYNVIogC6x0BDQgUOjgFEorCDq2NlAGlGrLjwKW8tkf91W4OT09P9rT3q/DeTySO9d1ru/n+/18rvuvk2c+3+89a60AAAAAQKPb7PQCAAAAAODGiFcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoA4GaYmXfNzL++tc/dnP9jM/OJm3r+fq73P2fmWZv3z56Z9x7Baz9jZt5+pK4HABw7xCsAgCQzc+nMPGGn13G9mXnpzFw7M1/a/PzVzPz2zJx8/Zi11nvWWvc7xGu99mDj1lo/vdb6wyOw9t0zs2Zm17Zrn73WeuLNvTYAcOwRrwAAer1+rXXHJHdO8pQkP5Dkwu0B60iYLf5fCABU8p8UAIADmJm/NzPnzswVM/O3m/d/f59h95mZD8zMF2fmzTNz523nP3pm3jczV83MR2bm9MNdw1rr2rXWR5OckeSKJL+0ufbpM/PpbXO9eGYu3+zU+sTMPH5mfirJv09yxsx8eWY+shn7rpl5+cycn+SaJPfez22Ms9ntdfXMfHxmHr/twHfsVNtnd9e7N69Xbeb80X1vQ5yZ02bmg5trf3BmTtt27F0z859n5vzNd3n7zJx0uL83AOC7g3gFAHBgt0ny+0nuleSeSb6a5Lf3GfMvkvyrJCcn+UaSVyfJzJyS5K1JXpat3VP/LsmfzMxdb8pC1lrXJXlzkh/b99jM3C/J85I8crNb6yeTXLrWeluS38jWLq4T1loP3nbaP0/y3CR3THLZfqZ8VJJPJjkpyX9M8qbtYe4AfnzzeqfNnO/fZ613ztbv5dVJ7pLkt5K8dWbusm3YLyT5l0nuluR7svW7AwCOQeIVAMABrLWuXGv9yVrrmrXWl5K8PMk/2mfYa9Zaf7HW+kqSX0vyT2fmuCTPTHLeWuu8tdY311rvSLI3yT++GUv6TLZC2L6uS3J8kgfOzG3XWpeutT55kGv9wVrro2utb6y1rt3P8c8nedVm59frk3wiyZ6bsfbr7Uny12ut12zmfl2Sjyd50rYxv7/W+qu11leT/HGShxyBeQGAo5B4BQBwADNz+5n53Zm5bGa+mK1b4u60iVPX+9S295cluW22divdK8nTNrcMXjUzVyV5bLZ2aN1UpyT5m30/XGv97yQvSPLSJJ+fmXNm5u4HudanDnL88rXW2vbvy5Ic7JqH4u654U6vy7L13a732W3vr0lywhGYFwA4ColXAAAH9ktJ7pfkUWut78u3b4mbbWPuse39PZNcm+QL2YpDr1lr3Wnbzx3WWmfelIVsHqr+pCTv2d/xtdYfrbUem61otpL85vWHbuSSN/b59U6Zme3f857Z2vmVJF9Jcvttx37gMK77mc0at7tnkssPch4AcAwSrwAAvu22M3O7bT+7svU8qK9m6+Hjd87Ws5/29cyZeeDM3D7Jryd54+b5VK9N8qSZ+cmZOW5zzdP388D3A5qZXTPzgCSvy1Yk+q39jLnfzDxuZo5P8rXNmr+5Ofy5JLtvwl8UvFuS58/MbWfmaUkekOS8zbGLkvyzzbFHJHnqtvOu2Mx97xu57nlJ7jszv7D5bmckeWCScw9zfQDAMUC8AgD4tvOyFX2u/3lpklcl+d5s7aS6IMnb9nPea5L8QbZudbtdkucnyVrrU0l+Llt/7e+KbO3EelEO/f9gZ8zMl5NcneQtSa5M8vC11mf2M/b4JGdu1vnZbIWnX9kce8Pm9cqZ+dAhzp0k/yvJqZtrvjzJU9daV26O/VqS+yT52yT/KckfXX/SWuuazfjzN7dLPnr7RTfX+Jls7Wq7MskvJ/mZtdYXDmNtAMAxYr7zMQYAAAAA0MPOKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoNaunV7A0eakk05au3fv3ullAAAAAHzXuPDCC7+w1rrr/o6JV4dp9+7d2bt3704vAwAAAOC7xsxcdmPH3DYIAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGrt2ukFHG0uufzq7H7JW3d6GQAAALCjLj1zz04vgWOEnVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKh1VMWrmfnyTq8BAAAAgFvPURWvAAAAADi2HJXxara8Ymb+YmYumZkzNp+fMDN/NjMf2nz+c5vPd8/MX87Mf5uZj87M22fmezfHnj8zH5uZi2fmnJ38XgAAAAB8p107vYCb6OeTPCTJg5OclOSDM/PuJFckecpa64szc1KSC2bmLZtzTk3y9LXWc2bmj5P8kySvTfKSJP9grfX1mbnTrf5NAAAAALhRR+XOqySPTfK6tdZ1a63PJfnzJI9MMkl+Y2YuTvKnSU5J8v2bc/7vWuuizfsLk+zevL84ydkz88wk39jfZDPz3JnZOzN7r7vm6lvkCwEAAABwQ0drvLoxz0hy1yQPX2s9JMnnktxuc+zr28Zdl2/vOtuT5HeSPCxbO7husBttrXXWWusRa61HHHf7E2+xxQMAAADwnY7WePWeJGfMzHEzc9ckP57kA0lOTPL5tda1M/MTSe51oIvMzG2S3GOt9c4kL96cf8Itu3QAAAAADtXR+syr/57kR5N8JMlK8strrc/OzNlJ/sfMXJJkb5KPH+Q6xyV57cycmK1bDl+91rrqFlw3AAAAAIfhqIpXa60TNq8ryYs2P9uPfyFbUWt/fnjbuFdu+/yxR3iZAAAAABwhR+ttgwAAAAAcA8QrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtXbt9AKONg865cTsPXPPTi8DAAAA4Jhg5xUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqLVrpxdwtLnk8quz+yVv3ellAAAAAN9FLj1zz04voZadVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUEq8AAAAAqCVeAQAAAFBLvAIAAACglngFAAAAQC3xCgAAAIBa4hUAAAAAtcQrAAAAAGqJVwAAAADUOmi8mpn33RoLmZknz8wDv1vmAQAAAODmO2i8WmuddmssJMmTk9waUenWmgcAAACAm+lQdl59efN6+sy8a2beODMfn5mzZ8tPzcwbto0/fWbO3bx/4sy8f2Y+NDNvmJkTNp+fOTMfm5mLZ+aVM3Nakp9N8oqZuWhm7rOZ67/MzN6Z+cuZeeTMvGlm/npmXrZtvmfOzAc25/3uzBx3/bpn5uUz85GZuWBmvv9G5nn+trWccyR/uQAAAADcPIf7zKuHJnlBtnYu3TvJY5L8aZJHzcwdNmPOSHLOzJyU5FeTPGGt9bAke5O8cGbukuQpSX5orfUPk7xsrfW+JG9J8qK11kPWWp/cXOvv1lqPSPJfk7w5yb9N8sNJnj0zd5mZB2zme8xa6yFJrkvyjM25d0hywVrrwUneneQ5NzLPS5I8dLOWf3OYvw8AAAAAbkGHG68+sNb69Frrm0kuSrJ7rfWNJG9L8qSZ2ZVkT7ZC06OzFbnOn5mLkjwryb2SXJ3ka0l+b2Z+Psk1B5jvLZvXS5J8dK31/9ZaX0/yf5LcI8njkzw8yQc3czw+W1EtSf4uybmb9xcm2X0jc1yc5OyZeWaSb+xvwMw8d7MDbO9111x9gOUCAAAAcCTtOszxX9/2/rpt55+T5HlJ/ibJ3rXWl2ZmkrxjrfX0fS8yMz+SrdD01M15jzvIfN/cZ+5vbuaeJH+41vqV/Zx77Vpr7Wet+9qT5MeTPCnJf5iZB22C3Lestc5KclaSHH/yqeuGlwAAAADglnC4O69uzJ8neViS52QrZCXJBUkeMzM/mCQzc4eZue/muVcnrrXOS/KLSR68Gf+lJHc8zHn/LMlTZ+ZumznuPDP3Osg535pnZm6T5B5rrXcmeXGSE5OccJhrAAAAAOAWckTi1VrrumzdovfTm9esta5I8uwkr5uZi5O8P8n9sxWOzt189t4kL9xc5pwkL5qZD8/MfQ5x3o9l67lab99c7x1JTj7Iad+aJ8mpSV47M5ck+XCSV6+1rjq0bw0AAADALW2+fWcdh+L4k09dJz/rVTu9DAAAAOC7yKVn7tnpJeyomblw80f7buBI3TYIAAAAAEeceAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABArV07vYCjzYNOOTF7z9yz08sAAAAAOCbYeQUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1xCsAAAAAaolXAAAAANQSrwAAAACoJV4BAAAAUEu8AgAAAKCWeAUAAABALfEKAAAAgFriFQAAAAC1du30Ao42l1x+dXa/5K07vQwAAADgGHXpmXt2egm3KjuvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQa0fj1cy87xDGvGBmbn8Lr+PJM/PAW3IOAAAAAA7fjsartdZphzDsBUkOK17NzHGHuZQnJxGvAAAAAMrs9M6rL29eT5+Zd83MG2fm4zNz9mx5fpK7J3nnzLxzM/aJM/P+mfnQzLxhZk7YfH7pzPzmzHwoydMOMO7MmfnYzFw8M6+cmdOS/GySV8zMRTNznx35ZQAAAABwA7t2egHbPDTJDyX5TJLzkzxmrfXqmXlhkp9Ya31hZk5K8qtJnrDW+srMvDjJC5P8+uYaV661HrYZ96Z9x83M7yR5SpL7r7XWzNxprXXVzLwlyblrrTfeul8ZAAAAgANpilcfWGt9Oklm5qIku5O8d58xj87W7X3nz0ySfE+S9287/vqDjLs6ydeS/N7MnJvk3ENZ2Mw8N8lzk+S477vrYX4tAAAAAG6qpnj19W3vr8v+1zZJ3rHWevqNXOMrBxs3Mz+S5PFJnprkeUked7CFrbXOSnJWkhx/8qnrYOMBAAAAODJ29JlXh+hLSe64eX9BksfMzA8myczcYWbuu59z9jtu89yrE9da5yX5xSQP3s8cAAAAAJQ4GuLVWUneNjPvXGtdkeTZSV43Mxdn61bA++97wgHG3THJuZvP3put52UlyTlJXjQzH/bAdgAAAIAes5a74A7H8Sefuk5+1qt2ehkAAADAMerSM/fs9BKOuJm5cK31iP0dOxp2XgEAAABwjBKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1Nq10ws42jzolBOz98w9O70MAAAAgGOCnVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcAAAAA1BKvAAAAAKglXgEAAABQS7wCAAAAoJZ4BQAAAEAt8QoAAACAWuIVAAAAALXEKwAAAABqiVcA/P/27j3WsrOs4/jvcaZQW+KUSkOwUAqxqWlAoWkQvDZAKtJKSWgUlVgKpn9I5BIJVk0katBqjCACTRBabg1oCqmIRlLaxgsqlFrT29BAAIGmULAXCSZg4+Mfew0chzPTM5fOfqstFbwAAA2ISURBVNrz+SST2WvttbrfPcnb98x31tobAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGGvnugfwYHPT7ffm5Iv+Zt3DAAAAALapz1189rqHcES58goAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAsQ44XlXVyVV18wMxmENRVWdW1YfWPQ4AAAAADp9te+VVVe1c9xgAAAAA2L+DjVc7q+ryqtpdVVdU1TFV9dtVdV1V3VxVb62qSpKqenlV3VpVN1bV+5Z9x1bVpVX18aq6oarO3exF9nHu06rqX5bz/rmqTt3kvE2PqaoXV9UHq+qaJFdX1buq6vkbzrt8X2MBAAAA4Mg72KuPTk3y0u7+aFVdmuRXkrypu383Sarq3UnOSfLXSS5K8oTu/kZVHbec/1tJrunulyz7Pl5VH+nur+/1Opud+8kkP97d91XVs5P8fpIX7HXe/o45PckPdvddVfWTSV6V5Mqq2pXkR5Kcf5B/JgAAAAAcZgcbr77Q3R9dHr8nycuTfLaqXpPkmCTHJ7klq3h1Y5LLq+rKJFcu55yV5HlV9epl++gkJyXZvdfrbHburiTvrKpTknSSozYZ3/6Ouaq770qS7v77qnpLVZ2QVdx6f3fft/d/rKouTHJhkuz4nhP2/ycDAAAAwGFzsLcN9ibbb0lyXnc/OcmfZxWkkuTsJG/O6oqn65bPmqokL+jupyy/Turu3VV1WVX9e1X97X7O/b0k13b3k5L8zIbX2Wh/x+x9dde7krwoyQVJLt30zXa/tbvP6O4zdhyza39/LgAAAAAcRgcbr06qqmcsj38hyT8tj79aVY9Icl6SVNV3JXlcd1+b5NezuiLqEUk+nORXN3wu1lOTpLsvWGLWc/dz7q4kty+v9+J9jG8rx+zxjiSvXF7/1vt95wAAAAAcMQcbr25L8rKq2p3kkUkuyepqq5uzClPXLcftSPKeqropyQ1J3tjd92R1ZdRRSW6sqluW7b3t69w/SvIHVXVD9n3b41aOSZJ095ezul3xsi29cwAAAACOmOre+w7A7aWqjklyU5LTu/ve+zv+4Y85pR9z/hse+IEBAAAAbOJzF5+97iEcdlV1fXefsdlzB3vl1UPC8k2Eu5P82VbCFQAAAABH1sF+2+BDQnd/JMnj1z0OAAAAADa3ra+8AgAAAGA28QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADG2rnuATzYPPnEXfnExWevexgAAAAA24IrrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgLPEKAAAAgLHEKwAAAADGEq8AAAAAGEu8AgAAAGAs8QoAAACAscQrAAAAAMYSrwAAAAAYS7wCAAAAYCzxCgAAAICxxCsAAAAAxhKvAAAAABhLvAIAAABgrOrudY/hQaWqvpbktnWPAx7kHpXkq+seBDwEmEtw6MwjODzMJTh0230ePb67T9jsiZ1HeiQPAbd19xnrHgQ8mFXVJ8wjOHTmEhw68wgOD3MJDp15tG9uGwQAAABgLPEKAAAAgLHEqwP31nUPAB4CzCM4PMwlOHTmERwe5hIcOvNoH3xgOwAAAABjufIKAAAAgLHEqy2qqudU1W1V9emqumjd44HJqupxVXVtVd1aVbdU1SuW/cdX1VVV9anl90cu+6uq3rjMrxur6vT1vgOYo6p2VNUNVfWhZfsJVfWxZb78RVU9bNn/8GX708vzJ69z3DBJVR1XVVdU1SerandVPcOaBAemql61/Fx3c1W9t6qOtibB/auqS6vqzqq6ecO+A16Dqur85fhPVdX563gv6yRebUFV7Ujy5iQ/neS0JD9fVaetd1Qw2n1Jfq27T0vy9CQvW+bMRUmu7u5Tkly9bCeruXXK8uvCJJcc+SHDWK9IsnvD9h8meX13f3+Su5O8dNn/0iR3L/tfvxwHrPxpkr/r7h9I8kNZzSlrEmxRVZ2Y5OVJzujuJyXZkeSFsSbBVrwjyXP22ndAa1BVHZ/ktUl+OMnTkrx2T/DaLsSrrXlakk9392e6+5tJ3pfk3DWPCcbq7ju6+9+Wx1/L6i8JJ2Y1b965HPbOJM9fHp+b5F298q9JjquqxxzhYcM4VfXYJGcneduyXUmemeSK5ZC959Ge+XVFkmctx8O2VlW7kvxEkrcnSXd/s7vviTUJDtTOJN9dVTuTHJPkjliT4H519z8kuWuv3Qe6Bv1Ukqu6+67uvjvJVfnOIPaQJl5tzYlJvrBh+4vLPuB+LJeJPzXJx5I8urvvWJ76UpJHL4/NMdjcG5K8Jsn/Ltvfm+Se7r5v2d44V741j5bn712Oh+3uCUm+kuSy5Rbct1XVsbEmwZZ19+1J/jjJ57OKVvcmuT7WJDhYB7oGbfu1SbwCHjBV9Ygk70/yyu7+r43P9eqrTn3dKexDVZ2T5M7uvn7dY4EHuZ1JTk9ySXc/NcnX8+3bM5JYk+D+LLcnnZtVDP6+JMdmm131AQ8Ua9DWiFdbc3uSx23YfuyyD9iHqjoqq3B1eXd/YNn95T23Xiy/37nsN8fgO/1okudV1eeyul39mVl9bs9xyy0byf+fK9+aR8vzu5L855EcMAz1xSRf7O6PLdtXZBWzrEmwdc9O8tnu/kp3/0+SD2S1TlmT4OAc6Bq07dcm8WprrktyyvJtGg/L6sMJP7jmMcFYy2cavD3J7u7+kw1PfTDJnm/GOD/JX23Y/0vLt2s8Pcm9Gy6jhW2pu3+jux/b3Sdnte5c092/mOTaJOcth+09j/bMr/OW4/0rHtted38pyReq6tRl17OS3BprEhyIzyd5elUds/yct2ceWZPg4BzoGvThJGdV1SOXKyHPWvZtG+X/IVtTVc/N6rNHdiS5tLtft+YhwVhV9WNJ/jHJTfn2Z/X8Zlafe/WXSU5K8h9Jfra771p+CHpTVpef/3eSC7r7E0d84DBUVZ2Z5NXdfU5VPTGrK7GOT3JDkhd19zeq6ugk787qM+buSvLC7v7MusYMk1TVU7L64oOHJflMkguy+kdcaxJsUVX9TpKfy+pbpW9I8stZfeaONQn2o6rem+TMJI9K8uWsvjXwyhzgGlRVL8nq71RJ8rruvuxIvo91E68AAAAAGMttgwAAAACMJV4BAAAAMJZ4BQAAAMBY4hUAAAAAY4lXAAAAAIwlXgEAAAAwlngFAAAAwFjiFQAAAABj/R8xzbP1bLaODgAAAABJRU5ErkJggg==\n"
          },
          "metadata": {
            "needs_background": "light"
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "sep = '_'*50\n",
        "num_aug_per_label=20\n",
        "n_per_class_total = 200\n",
        "epochs=5\n",
        "gpt2_pipe = nlu.load('gpt2')\n",
        "gpt2_pipe['gpt2'].setMaxOutputLength(30)\n",
        "prompt_label_prefix=' contract clause summary:'\n",
        "\n",
        "no_aug_fitted_classifier, aug_fitted_classifier = compare_vanilla_and_augmented_training(\n",
        "  base_dataset=shrink_train_df,\n",
        "  generator_model=gpt2_pipe,\n",
        "  embeddings_to_use=embeddings_to_use,\n",
        "  n_per_class_total=n_per_class_total,\n",
        "  train_test_frac=train_test_frac,\n",
        "  label_col=label_col,\n",
        "  prompt_label_prefix=prompt_label_prefix,\n",
        "  slice_len=slice_len,\n",
        "  num_aug_per_label=num_aug_per_label,\n",
        "  epochs=epochs,\n",
        "  learn_rate=learn_rate,\n",
        "  batch_size=batch_size,\n",
        "  droput=droput)\n",
        "# Strong improvement in some classes, great!"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7VKrxdWIxzYo",
        "outputId": "aba502e4-204d-4cb9-e9cf-6003c42bd066",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "gpt2 download started this may take some time.\n",
            "Approximate size to download 442.7 MB\n",
            "[OK!]\n",
            "__________________________________________________ Starting Data Augmentation Experiment __________________________________________________\n",
            "Train Dataset Size = 80\n",
            "Test Dataset Size = 720\n",
            "Training Vanilla Classifier\n",
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "Training Settings : Epochs=5, learn_rate=0.0005, batch_size=32, dropout=0.5\n",
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n",
            "__________________________________________________ Metrics on Train datataset with Vanilla Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            " base-salary       0.45      0.99      0.62       180\n",
            "    interest       1.00      0.01      0.02       180\n",
            " investments       0.52      0.93      0.67       180\n",
            "       loans       1.00      0.01      0.02       180\n",
            "\n",
            "    accuracy                           0.49       720\n",
            "   macro avg       0.74      0.49      0.33       720\n",
            "weighted avg       0.74      0.49      0.33       720\n",
            "\n",
            "__________________________________________________ Metrics on Test Dataset with Vanilla Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            " base-salary       0.48      1.00      0.65        55\n",
            "    interest       1.00      0.02      0.04        44\n",
            " investments       0.32      0.66      0.43        29\n",
            "       loans       1.00      0.02      0.04        47\n",
            "\n",
            "    accuracy                           0.43       175\n",
            "   macro avg       0.70      0.42      0.29       175\n",
            "weighted avg       0.72      0.43      0.30       175\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Generating for label=interest No. 1/4 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/20 for prompt : <interest contract clause summary: Interest on borrowings;> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: Interest on borrowings; interest on debt; interest in securities; interest paid on securities; and interest on securities.\n",
            "\n",
            "The\n",
            "__________________________________________________ Generating new training data example 2/20 for prompt : <interest contract clause summary: 1. Interest arising in a Cont> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: 1. Interest arising in a Contribution to the Fund shall be paid to the Trustee, and the Trust shall be entitled to\n",
            "__________________________________________________ Generating new training data example 3/20 for prompt : <interest contract clause summary: (a) The Loans comprising each> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: (a) The Loans comprising each of the Loans shall be deemed to be a single, limited liability company, and shall be subject\n",
            "__________________________________________________ Generating new training data example 4/20 for prompt : <interest contract clause summary: (a) The Loans comprising each> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: (a) The Loans comprising each of the Loans shall be deemed to be a single, limited liability company, and shall be subject\n",
            "__________________________________________________ Generating new training data example 5/20 for prompt : <interest contract clause summary: 1. Interest arising in one of> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: 1. Interest arising in one of the following: (a) a contract for the purchase of a motor vehicle or a motor boat\n",
            "__________________________________________________ Generating new training data example 6/20 for prompt : <interest contract clause summary: The Borrower shall pay intere> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: The Borrower shall pay interexample interest on the amount of the loan. The BOR shall pay interest on any\n",
            "__________________________________________________ Generating new training data example 7/20 for prompt : <interest contract clause summary: (a) The unpaid principal amou> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: (a) The unpaid principal amoung the principal amount of the principal balance of the partnership; (b) The principal amount\n",
            "__________________________________________________ Generating new training data example 8/20 for prompt : <interest contract clause summary: (a) Each Loan shall bear inte> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: (a) Each Loan shall bear interesse of the principal amount of the loan, and (b) Each loan shall\n",
            "__________________________________________________ Generating new training data example 9/20 for prompt : <interest contract clause summary: Borrower hereby promises to e> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: Borrower hereby promises to e-mail the following to the following address:\n",
            "\n",
            "Borrower\n",
            "\n",
            "P.O\n",
            "__________________________________________________ Generating new training data example 10/20 for prompt : <interest contract clause summary: The outstanding principal bal> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: The outstanding principal balancer balance is $1.5 million. The principal balance is due on the date of the sale of the\n",
            "__________________________________________________ Generating new training data example 11/20 for prompt : <interest contract clause summary: If the Company does not pay a> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: If the Company does not pay a reasonable amount of the Company's reasonable costs, the Company may not be required to pay any reasonable\n",
            "__________________________________________________ Generating new training data example 12/20 for prompt : <interest contract clause summary: Interest accrued on the Revol> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: Interest accrued on the Revolv contract is not included in the Revolver's total interest in the contract.\n",
            "\n",
            "The Rev\n",
            "__________________________________________________ Generating new training data example 13/20 for prompt : <interest contract clause summary: (a) LEVEL 3 FINANCING, INC., > __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: (a) LEVEL 3 FINANCING, INC., a wholly-owned subsidiary of the Company, is a wholly owned subsidiary of\n",
            "__________________________________________________ Generating new training data example 14/20 for prompt : <interest contract clause summary: (a) Subject to the provisions> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: (a) Subject to the provisions of subsection (2), the Secretary of State shall provide for the payment of the cost of the\n",
            "__________________________________________________ Generating new training data example 15/20 for prompt : <interest contract clause summary: Simple interest shall accrue > __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: Simple interest shall accrue to the principal of the principal's principal residence in the United States.\n",
            "\n",
            "(b) The principal\n",
            "__________________________________________________ Generating new training data example 16/20 for prompt : <interest contract clause summary: (a) Each Note will bear inter> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: (a) Each Note will bear inter alia the following: (i) The date of the commencement of the contract; (\n",
            "__________________________________________________ Generating new training data example 17/20 for prompt : <interest contract clause summary: This Security shall not bear > __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: This Security shall not bear any relation to any other security.\n",
            "\n",
            "(2) The Security shall be deemed to be a security\n",
            "__________________________________________________ Generating new training data example 18/20 for prompt : <interest contract clause summary: SILGAN HOLDINGS INC., a Delaw> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: SILGAN HOLDINGS INC., a Delawares-based company, is a Delaware corporation. The company is a subsidiary of\n",
            "__________________________________________________ Generating new training data example 19/20 for prompt : <interest contract clause summary: Interest on Advances shall be> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: Interest on Advances shall be paid to the principal of the Company, and shall be payable to the Principal of the company, and\n",
            "__________________________________________________ Generating new training data example 20/20 for prompt : <interest contract clause summary: L-3 Communications Corporatio> __________________________________________________\n",
            "Generated : \n",
            "  interest contract clause summary: L-3 Communications Corporatio- tionale, Inc. (L-3) (L.C.) (L.-\n",
            "__________________________________________________ Generating for label=loans No. 2/4 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/20 for prompt : <loans contract clause summary: (a) Amount of new Loan: $____> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: (a) Amount of new Loan: $____$ $____ $____ (b) Amounts of new loan: $__\n",
            "__________________________________________________ Generating new training data example 2/20 for prompt : <loans contract clause summary: Book Value (e) credit card bu> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Book Value (e) credit card buxom loan contract clause Summary: Book value (e)(1) credit cards bux\n",
            "__________________________________________________ Generating new training data example 3/20 for prompt : <loans contract clause summary: Revolving credit loans made o> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Revolving credit loans made o/o the Federal Reserve System.\n",
            "\n",
            "The following is a summary of the terms of the contract\n",
            "__________________________________________________ Generating new training data example 4/20 for prompt : <loans contract clause summary: Except as otherwise agreed be> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Except as otherwise agreed be the terms of the contract, the terms and conditions of the agreement shall be as follows:\n",
            "\n",
            "1\n",
            "__________________________________________________ Generating new training data example 5/20 for prompt : <loans contract clause summary: Make any loans or other advan> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Make any loans or other advances that are not in the contract.\n",
            "\n",
            "Make any loans that are in the contracts.\n",
            "__________________________________________________ Generating new training data example 6/20 for prompt : <loans contract clause summary: Any Partner may loan funds to> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Any Partner may loan funds to a Partner for the purpose of providing services to the Partner.\n",
            "\n",
            "Any Partner may lend funds to\n",
            "__________________________________________________ Generating new training data example 7/20 for prompt : <loans contract clause summary: GC will make loans to Borrowe> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: GC will make loans to Borrowe, but will not make loans for the rest of the season.\n",
            "\n",
            "\"We are\n",
            "__________________________________________________ Generating new training data example 8/20 for prompt : <loans contract clause summary: Subject to the terms and cond> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Subject to the terms and condenses of the contract, the Company shall pay the Company $1,000,000 for each of\n",
            "__________________________________________________ Generating new training data example 9/20 for prompt : <loans contract clause summary: No loan shall be contracted o> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: No loan shall be contracted o the United States Government for the purpose of providing for the payment of any loan or other obligation of the\n",
            "__________________________________________________ Generating new training data example 10/20 for prompt : <loans contract clause summary: Make any loans or other advan> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Make any loans or other advances that are not in the contract.\n",
            "\n",
            "Make any loans that are in the contracts.\n",
            "__________________________________________________ Generating new training data example 11/20 for prompt : <loans contract clause summary: Coast will make loans to Borr> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Coast will make loans to Borr, but will not make loans for the other players.\n",
            "\n",
            "The contract will be renewed for\n",
            "__________________________________________________ Generating new training data example 12/20 for prompt : <loans contract clause summary: The Plan Administrator may, i> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: The Plan Administrator may, i.e., the Plan Administrator shall, by rule, require the Plan administrator to provide a summary of\n",
            "__________________________________________________ Generating new training data example 13/20 for prompt : <loans contract clause summary: Silicon will make loans to Bo> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Silicon will make loans to BoE and other financial institutions to help them meet their debt obligations.\n",
            "\n",
            "The company will also provide\n",
            "__________________________________________________ Generating new training data example 14/20 for prompt : <loans contract clause summary: Except as disclosed or provid> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Except as disclosed or providenced by the terms of the contract, the Company shall not be liable for any loss or damage arising\n",
            "__________________________________________________ Generating new training data example 15/20 for prompt : <loans contract clause summary: 3.1. On each Loan Subscriptio> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: 3.1. On each Loan Subscriptio, the Company shall provide to the Loan Subscription Company a summary of the terms\n",
            "__________________________________________________ Generating new training data example 16/20 for prompt : <loans contract clause summary: Silicon will make one or more> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Silicon will make one or more of the following:\n",
            "\n",
            "(a) a computer system that is capable of performing a computation that\n",
            "__________________________________________________ Generating new training data example 17/20 for prompt : <loans contract clause summary: Make advances, loans or exten> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Make advances, loans or extenuating circumstances.\n",
            "\n",
            "Make advances, loan or extensuating circumstances\n",
            "\n",
            "If you are\n",
            "__________________________________________________ Generating new training data example 18/20 for prompt : <loans contract clause summary: Each Loan, each payment or pr> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Each Loan, each payment or proration of a loan, is a contract for the payment of a specified amount of money or money\n",
            "__________________________________________________ Generating new training data example 19/20 for prompt : <loans contract clause summary: Subject to the terms and cond> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: Subject to the terms and condenses of the contract, the Company shall pay the Company $1,000,000 for each of\n",
            "__________________________________________________ Generating new training data example 20/20 for prompt : <loans contract clause summary: (a) Each (i) Revolving A Loan> __________________________________________________\n",
            "Generated : \n",
            "  loans contract clause summary: (a) Each (i) Revolving A Loan Contract (or any other Revolving Loan Contract) shall be deemed to be\n",
            "__________________________________________________ Generating for label=investments No. 3/4 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/20 for prompt : <investments contract clause summary: Make or own any Investments, > __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: Make or own any Investments, or any other Investment, or make any Other Investment, in any State or Territory, or in any\n",
            "__________________________________________________ Generating new training data example 2/20 for prompt : <investments contract clause summary: Make or hold any Investments,> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: Make or hold any Investments, including any of the following: (1) Any of the assets of the Company, including the Company\n",
            "__________________________________________________ Generating new training data example 3/20 for prompt : <investments contract clause summary: All Investments of each of th> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: All Investments of each of th e Company's subsidiaries, and all of its affiliates, shall be subject to the terms and conditions of\n",
            "__________________________________________________ Generating new training data example 4/20 for prompt : <investments contract clause summary: (i) existing on the Closing D> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: (i) existing on the Closing Determination date, and (ii) the date of the Closing Date. (2) The\n",
            "__________________________________________________ Generating new training data example 5/20 for prompt : <investments contract clause summary: Neither the Borrower nor any > __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: Neither the Borrower nor any of the other parties to the contract has any obligation to pay any of its obligations to the B\n",
            "__________________________________________________ Generating new training data example 6/20 for prompt : <investments contract clause summary: The Customer will not, direct> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: The Customer will not, direct or indirectly, receive any compensation or benefit from the sale of the Company's common stock, or any\n",
            "__________________________________________________ Generating new training data example 7/20 for prompt : <investments contract clause summary: The Borrower will not, nor wi> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: The Borrower will not, nor wiht any other person, be liable for any loss or damage arising out of or in\n",
            "__________________________________________________ Generating new training data example 8/20 for prompt : <investments contract clause summary: The Borrower will not, nor wi> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: The Borrower will not, nor wiht any other person, be liable for any loss or damage arising out of or in\n",
            "__________________________________________________ Generating new training data example 9/20 for prompt : <investments contract clause summary: The Borrower shall not, and s> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: The Borrower shall not, and s/he shall not be required to, make any payment to the Borrowor for\n",
            "__________________________________________________ Generating new training data example 10/20 for prompt : <investments contract clause summary: Directly or indirectly acquir> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: Directly or indirectly acquirably, the Company may acquire, lease, or otherwise dispose of any of its subsidiaries, affiliates,\n",
            "__________________________________________________ Generating new training data example 11/20 for prompt : <investments contract clause summary: The Borrower shall not, and s> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: The Borrower shall not, and s/he shall not be required to, make any payment to the Borrowor for\n",
            "__________________________________________________ Generating new training data example 12/20 for prompt : <investments contract clause summary: From the date of this Agreeme> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: From the date of this Agreeme, the Company has agreed to pay $1.5 million to the United States Department of\n",
            "__________________________________________________ Generating new training data example 13/20 for prompt : <investments contract clause summary: The Borrower will not, nor wi> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: The Borrower will not, nor wiht any other person, be liable for any loss or damage arising out of or in\n",
            "__________________________________________________ Generating new training data example 14/20 for prompt : <investments contract clause summary: Save for those provided in th> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: Save for those provided in thirteenth paragraph, the cost of the contract is $1,000,000.\n",
            "\n",
            "The\n",
            "__________________________________________________ Generating new training data example 15/20 for prompt : <investments contract clause summary: No more than 45% of the “valu> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: No more than 45% of thevaluables\n",
            "\n",
            "valuable\n",
            "\n",
            "A function that takes a list of values\n",
            "__________________________________________________ Generating new training data example 16/20 for prompt : <investments contract clause summary: The Borrower shall not make o> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: The Borrower shall not make omissions or omissions that would have been made by the borrower if the borrower had not made\n",
            "__________________________________________________ Generating new training data example 17/20 for prompt : <investments contract clause summary: Except for Permitted Investme> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: Except for Permitted Investme, the following are permitted investments:\n",
            "\n",
            "1. The purchase of a common stock or a common\n",
            "__________________________________________________ Generating new training data example 18/20 for prompt : <investments contract clause summary: (i) Other than in accordance > __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: (i) Other than in accordance with paragraph (2)(b), the Secretary shall not enter into any contract with any entity that\n",
            "__________________________________________________ Generating new training data example 19/20 for prompt : <investments contract clause summary: Make any Investments other th> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: Make any Investments other thier Terms of Use, including the Terms of Service, and any other terms of the Agreement, and make\n",
            "__________________________________________________ Generating new training data example 20/20 for prompt : <investments contract clause summary: The Subadviser is hereby auth> __________________________________________________\n",
            "Generated : \n",
            "  investments contract clause summary: The Subadviser is hereby authured to provide the Subadvisor with the following information: (1) The name,\n",
            "__________________________________________________ Generating for label=base-salary No. 4/4 __________________________________________________\n",
            "__________________________________________________ Generating new training data example 1/20 for prompt : <base-salary contract clause summary: A base annual salary equal to> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: A base annual salary equal to the base salary of the player who is the team's starting pitcher.\n",
            "\n",
            "A\n",
            "__________________________________________________ Generating new training data example 2/20 for prompt : <base-salary contract clause summary: The Company shall pay to Exec> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: The Company shall pay to Executives and Directors of the Company a minimum of $1,000,000 per\n",
            "__________________________________________________ Generating new training data example 3/20 for prompt : <base-salary contract clause summary: Effective July 1, 1997, the b> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: Effective July 1, 1997, the bursary contract shall be paid to the employee for the period of the employee\n",
            "__________________________________________________ Generating new training data example 4/20 for prompt : <base-salary contract clause summary: Executive shall be paid a bas> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: Executive shall be paid a basestar salary of $1,000,000.\n",
            "\n",
            "Executive shall be compensated\n",
            "__________________________________________________ Generating new training data example 5/20 for prompt : <base-salary contract clause summary: Subject to the terms and cond> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: Subject to the terms and conditions of the contract, the employee shall be entitled to receive a salary equal to the\n",
            "__________________________________________________ Generating new training data example 6/20 for prompt : <base-salary contract clause summary: For all services rendered und> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: For all services rendered undeliverable, the employee shall be entitled to a full-time, full-term\n",
            "__________________________________________________ Generating new training data example 7/20 for prompt : <base-salary contract clause summary: During the Employment Term, f> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: During the Employment Term, furloughs are not considered to be a form of pay restraint.\n",
            "\n",
            "The\n",
            "__________________________________________________ Generating new training data example 8/20 for prompt : <base-salary contract clause summary: As compensation for services > __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: As compensation for services rendered, the employee shall be paid a salary equal to the employee's annual salary plus the employee\n",
            "__________________________________________________ Generating new training data example 9/20 for prompt : <base-salary contract clause summary: In consideration of the Execu> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: In consideration of the Execu- sion of the contract, the Company shall pay the Company $1,000\n",
            "__________________________________________________ Generating new training data example 10/20 for prompt : <base-salary contract clause summary: During the Term, the Company > __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: During the Term, the Company will pay the Company $1,000,000 for each of the following: (\n",
            "__________________________________________________ Generating new training data example 11/20 for prompt : <base-salary contract clause summary: The Company shall pay to the > __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: The Company shall pay to the employee a salary equal to the amount of the employee's salary plus the employee-sal\n",
            "__________________________________________________ Generating new training data example 12/20 for prompt : <base-salary contract clause summary: During the Term, Executive wi> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: During the Term, Executive wi-fi service is provided to the Executive Director of the Office of Management and Budget.\n",
            "__________________________________________________ Generating new training data example 13/20 for prompt : <base-salary contract clause summary: The Company shall pay Executi> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: The Company shall pay Executi- tional and other compensation to the Company for the services rendered by the Company in\n",
            "__________________________________________________ Generating new training data example 14/20 for prompt : <base-salary contract clause summary: As compensation for the servi> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: As compensation for the servi- tation of the United States, the Secretary of Labor shall provide for the payment\n",
            "__________________________________________________ Generating new training data example 15/20 for prompt : <base-salary contract clause summary: For performance of services u> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: For performance of services u.s.s., the employee shall be paid a salary equal to the employee's annual\n",
            "__________________________________________________ Generating new training data example 16/20 for prompt : <base-salary contract clause summary: During the Employment Period,> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: During the Employment Period, the employer shall provide the employee with a written notice of termination of employment. The notice shall\n",
            "__________________________________________________ Generating new training data example 17/20 for prompt : <base-salary contract clause summary: During the period of this Agr> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: During the period of this Agrarian Contract, the Secretary shall provide to the Secretary a summary of the amount of\n",
            "__________________________________________________ Generating new training data example 18/20 for prompt : <base-salary contract clause summary: Employer shall pay to Executi> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: Employer shall pay to Executi- tion the following: (1) The amount of the salary and benefits paid\n",
            "__________________________________________________ Generating new training data example 19/20 for prompt : <base-salary contract clause summary: During the Term, the Company > __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: During the Term, the Company will pay the Company $1,000,000 for each of the following: (\n",
            "__________________________________________________ Generating new training data example 20/20 for prompt : <base-salary contract clause summary: During the term of Executive’> __________________________________________________\n",
            "Generated : \n",
            "  base-salary contract clause summary: During the term of ExecutiveThe first time I saw the new \"Star Wars\" movie, I was so excited\n",
            "Augmented Training Dataset Size = 160\n",
            "Training Augmented Classifierr\n",
            "sent_small_bert_L2_128 download started this may take some time.\n",
            "Approximate size to download 16.1 MB\n",
            "[OK!]\n",
            "Training Settings : Epochs=5, learn_rate=0.0005, batch_size=32, dropout=0.5\n",
            "sentence_detector_dl download started this may take some time.\n",
            "Approximate size to download 354.6 KB\n",
            "[OK!]\n",
            "__________________________________________________ Metrics on vanilla Train datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            " base-salary       0.49      1.00      0.65        55\n",
            "    interest       0.00      0.00      0.00        44\n",
            " investments       0.89      0.55      0.68        29\n",
            "       loans       0.84      0.79      0.81        47\n",
            "\n",
            "    accuracy                           0.62       175\n",
            "   macro avg       0.55      0.58      0.54       175\n",
            "weighted avg       0.53      0.62      0.54       175\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on vanilla Train datataset with VANILLA Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            " base-salary       0.45      0.99      0.62       180\n",
            "    interest       1.00      0.01      0.02       180\n",
            " investments       0.52      0.93      0.67       180\n",
            "       loans       1.00      0.01      0.02       180\n",
            "\n",
            "    accuracy                           0.49       720\n",
            "   macro avg       0.74      0.49      0.33       720\n",
            "weighted avg       0.74      0.49      0.33       720\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Test datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            " base-salary       0.51      1.00      0.68        79\n",
            "    interest       0.33      0.01      0.03        71\n",
            " investments       0.55      0.67      0.60        52\n",
            "       loans       0.80      0.60      0.69        75\n",
            "\n",
            "    accuracy                           0.58       277\n",
            "   macro avg       0.55      0.57      0.50       277\n",
            "weighted avg       0.55      0.58      0.50       277\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Test datataset with VANILLA Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            " base-salary       0.48      1.00      0.65        55\n",
            "    interest       1.00      0.02      0.04        44\n",
            " investments       0.32      0.66      0.43        29\n",
            "       loans       1.00      0.02      0.04        47\n",
            "\n",
            "    accuracy                           0.43       175\n",
            "   macro avg       0.70      0.42      0.29       175\n",
            "weighted avg       0.72      0.43      0.30       175\n",
            "\n",
            "____________________________________________________________________________________________________\n",
            "__________________________________________________ Metrics on Augmented Train datataset with AUGMENTED Model __________________________________________________\n",
            "              precision    recall  f1-score   support\n",
            "\n",
            " base-salary       0.46      1.00      0.63       180\n",
            "    interest       0.50      0.01      0.01       180\n",
            " investments       0.71      0.84      0.77       180\n",
            "       loans       0.68      0.42      0.52       180\n",
            "\n",
            "    accuracy                           0.57       720\n",
            "   macro avg       0.59      0.57      0.48       720\n",
            "weighted avg       0.59      0.57      0.48       720\n",
            "\n",
            "______________________________________________________________________________________________________________________________________________________\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# There are many more models you can put to use in 1 line of code!\n",
        "## Checkout [the Modelshub](https://nlp.johnsnowlabs.com/models) and the [NLU Namespace](https://nlu.johnsnowlabs.com/docs/en/spellbook) for more models\n",
        "\n",
        "### NLU Webinars and Video Tutorials\n",
        "- [NLU & Streamlit Tutorial](https://vimeo.com/579508034#)\n",
        "- [Crash course of the 50 + Medical Domains and the 200+ Healtchare models in NLU](https://www.youtube.com/watch?v=gGDsZXt1SF8)\n",
        "- [Multi Lingual NLU Webinar - Tutorial on Chinese News dataset](https://www.youtube.com/watch?v=ftAOqJuxnV4)\n",
        "- [John Snow Labs NLU: Become a Data Science Superhero with One Line of Python code](https://events.johnsnowlabs.com/john-snow-labs-nlu-become-a-data-science-superhero-with-one-line-of-python-code?hsCtaTracking=c659363c-2188-4c86-945f-5cfb7b42fcfc%7C8b2b188b-92a3-48ba-ad7e-073b384425b0)\n",
        "- [Python Web Def Conf - Python's NLU library: 1,000+ Models, 200+ Languages, State of the Art Accuracy, 1 Line of Code](https://2021.pythonwebconf.com/presentations/john-snow-labs-nlu-the-simplicity-of-python-the-power-of-spark-nlp)\n",
        "- [NYC/DC NLP Meetup with NLU](https://youtu.be/hJR9m3NYnwk?t=2155)\n",
        "\n",
        "### More ressources \n",
        "- [Join our Slack](https://join.slack.com/t/spark-nlp/shared_invite/zt-lutct9gm-kuUazcyFKhuGY3_0AMkxqA)\n",
        "- [NLU Website](https://nlu.johnsnowlabs.com/)\n",
        "- [NLU Github](https://github.com/JohnSnowLabs/nlu)\n",
        "- [Many more NLU example tutorials](https://github.com/JohnSnowLabs/nlu/tree/master/examples)\n",
        "- [Overview of every powerful nlu 1-liner](https://nlu.johnsnowlabs.com/docs/en/examples)\n",
        "- [Checkout the Modelshub for an overview of all models](https://nlp.johnsnowlabs.com/models) \n",
        "- [Checkout the NLU Namespace where you can find every model as a tabel](https://nlu.johnsnowlabs.com/docs/en/spellbook)\n",
        "- [Intro to NLU article](https://medium.com/spark-nlp/1-line-of-code-350-nlp-models-with-john-snow-labs-nlu-in-python-2f1c55bba619)\n",
        "- [Indepth and easy Sentence Similarity Tutorial, with StackOverflow Questions using BERTology embeddings](https://medium.com/spark-nlp/easy-sentence-similarity-with-bert-sentence-embeddings-using-john-snow-labs-nlu-ea078deb6ebf)\n",
        "- [1 line of Python code for BERT, ALBERT, ELMO, ELECTRA, XLNET, GLOVE, Part of Speech with NLU and t-SNE](https://medium.com/spark-nlp/1-line-of-code-for-bert-albert-elmo-electra-xlnet-glove-part-of-speech-with-nlu-and-t-sne-9ebcd5379cd)"
      ],
      "metadata": {
        "id": "LVEih75yxtw-",
        "pycharm": {
          "name": "#%% md\n"
        }
      }
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "MbVDVR8-xuE0",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "provenance": [],
      "toc_visible": true
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}